On May 27, 2008, at 11:47 AM, Jim Kusznir wrote:
> I have updated to OpenMPI 1.2.6 and had the user rerun his jobs. He's
> getting similar output:
> [root_at_aeolus logs]# more 2047.aeolus.OU
> Warning: no access to tty (Bad file descriptor).
> Thus no job control in this shell.
> data directory is /mnt/pvfs2/patton/data/chem/aa1
> exec directory is /mnt/pvfs2/patton/exec/chem/aa1
> arch directory is /mnt/pvfs2/patton/data/chem/aa1
> mpirun: killing job...
FWIW: this message ("mpirun: killing job...") *only* displays if
mpirun catches a SIGINT or SIGTERM.
This seems quite fishy; I seem to recall that torque sends a TERM at
T-30 seconds before the job's wallclock time runs out. Can you do a
stupid test? Replace the "mpirun..." with some other command --
perhaps a short C program that outputs a line every N seconds or
something, just so that you can see continued progress. See if it
dies (or catches a SIGINT or SIGTERM) in about the same amount of time
that mpirun typically dies.