Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI+PGI errors
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-05-28 08:19:03

On May 27, 2008, at 11:47 AM, Jim Kusznir wrote:

> I have updated to OpenMPI 1.2.6 and had the user rerun his jobs. He's
> getting similar output:
> [root_at_aeolus logs]# more 2047.aeolus.OU
> Warning: no access to tty (Bad file descriptor).
> Thus no job control in this shell.
> data directory is /mnt/pvfs2/patton/data/chem/aa1
> exec directory is /mnt/pvfs2/patton/exec/chem/aa1
> arch directory is /mnt/pvfs2/patton/data/chem/aa1
> mpirun: killing job...

FWIW: this message ("mpirun: killing job...") *only* displays if
mpirun catches a SIGINT or SIGTERM.

This seems quite fishy; I seem to recall that torque sends a TERM at
T-30 seconds before the job's wallclock time runs out. Can you do a
stupid test? Replace the "mpirun..." with some other command --
perhaps a short C program that outputs a line every N seconds or
something, just so that you can see continued progress. See if it
dies (or catches a SIGINT or SIGTERM) in about the same amount of time
that mpirun typically dies.

Jeff Squyres
Cisco Systems