Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI+PGI errors
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-05-23 14:54:35


This may be a dumb question, but is there a chance that his job is
running beyond 30 minutes, and PBS/Torque/whatever is killing it?

On May 20, 2008, at 4:23 PM, Jim Kusznir wrote:

> Hello all:
>
> I've got a user on our ROCKS 4.3 cluster that's having some strange
> errors. I have other users using the cluster without any such errors
> reported, but this user also runs this code on other clusters without
> any problems, so I'm not really sure where the problem lies. They are
> getting logs with the following:
>
> --------
> Warning: no access to tty (Bad file descriptor).
> Thus no job control in this shell.
> data directory is /mnt/pvfs2/patton/data/chem/aa1
> exec directory is /mnt/pvfs2/patton/exec/chem/aa1
> arch directory is /mnt/pvfs2/patton/data/chem/aa1
> mpirun: killing job...
>
> Terminated
> --------------------------------------------------------------------------
> WARNING: mpirun is in the process of killing a job, but has detected
> an
> interruption (probably control-C).
>
> It is dangerous to interrupt mpirun while it is killing a job (proper
> termination may not be guaranteed). Hit control-C again within 1
> second if you really want to kill mpirun immediately.
> --------------------------------------------------------------------------
> mpirun noticed that job rank 0 with PID 14126 on node
> compute-0-23.local exited on signal 15 (Terminated).
> [compute-0-23.local:14124] [0,0,0]-[0,0,1] mca_oob_tcp_msg_recv: readv
> failed: Connection reset by peer (104)
> ---------
>
> The job was submitted with:
> ---------
> #!/bin/csh
> ##PBS -N for.chem.aa1
> #PBS -l nodes=2
> #PBS -l walltime=0:30:00
> #PBS -m n
> #PBS -j oe
> #PBS -o /home/patton/logs
> #PBS -e /home/patton/logs
> #PBS -V
> #
> # ------ set case specific parameters
> # and setup directory structure
> #
> set time=000001_000100
> #
> set case=aa1
> set type=chem
> #
> # ---- set up directories
> #
> set SCRATCH=/mnt/pvfs2/patton
> mkdir -p $SCRATCH
>
> set datadir=$SCRATCH/data/$type/$case
> set execdir=$SCRATCH/exec/$type/$case
> set archdir=$SCRATCH/data/$type/$case
> set les_output=les.$type.$case.out.$time
>
> set compdir=$HOME/compile/$type/$case
> #set compdir=$HOME/compile/free/aa1
>
> echo 'data directory is ' $datadir
> echo 'exec directory is ' $execdir
> echo 'arch directory is ' $archdir
>
> mkdir -p $datadir
> mkdir -p $execdir
> #
> cd $execdir
> rm -fr *
> cp $compdir/* .
> #
> # ------- build machine file for code to read setup
> #
> # ------------ set imachine=0 for NCAR IBM SP : bluevista
> # imachine=1 for NCAR IBM SP : bluesky
> # imachine=2 for ASC SGI Altix : eagle
> # imachine=3 for ERDC Cray XT3 : sapphire
> # imachine=4 for ASC HP XC : falcon
> # imachine=5 for NERSC Cray XT4 : franklin
> # imachine=6 for WSU Cluster : aeolus
> #
> set imachine=6
> set store_files=1
> set OMP_NUM_THREADS=1
> #
> echo $imachine > mach.file
> echo $store_files >> mach.file
> echo $datadir >> mach.file
> echo $archdir >> mach.file
> #
> # ---- submit the run
> #
> mpirun -n 2 ./lesmpi.a > $les_output
> #
> # ------ clean up
> #
> mv $execdir/u.* $datadir
> mv $execdir/p.* $datadir
> mv $execdir/his.* $datadir
> cp $execdir/$les_output $datadir
> #
> echo 'job ended '
> exit
> #
> -------------
> (its possible this particular script doesn't match this particular
> error...The user ran the job, and this is what I assembled from
> conversations with him. In any case, its representative to the jobs
> he's running, and they're all returning similar errors.)
>
> The error occurs at varying time steps in the runs, and if run without
> MPI, it runs fine to completion.
>
> Here's the version info:
>
> [kusznir_at_aeolus ~]$ rpm -qa |grep pgi
> pgilinux86-64-707-1
> openmpi-pgi-docs-1.2.4-1
> openmpi-pgi-devel-1.2.4-1
> roll-pgi-usersguide-4.3-0
> openmpi-pgi-runtime-1.2.4-1
> mpich-ethernet-pgi-1.2.7p1-1
> pgi-rocks-4.3-0
>
> The OpenMPI rpms were built from the supplied spec (or nearly so,
> anyway) with the following command line:
> CC=pgcc CXX=pgCC F77=pgf77 FC=pgf90 rpmbuild -bb --define
> 'install_in_opt 1' --d
> efine 'install_modulefile 1' --define 'modules_rpm_name environment-
> modules' --d
> efine 'build_all_in_one_rpm 0' --define 'configure_options --with-
> tm=/opt/torqu
> e' --define '_name openmpi-pgi' --define 'use_default_rpm_opt_flags
> 0' openmpi.s
> pec
>
> Any suggestions?
>
> Thanks!
>
> --Jim
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems