Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: [OMPI users] OpenMPI+PGI errors
From: Jim Kusznir (jkusznir_at_[hidden])
Date: 2008-05-20 16:23:54

Hello all:

I've got a user on our ROCKS 4.3 cluster that's having some strange
errors. I have other users using the cluster without any such errors
reported, but this user also runs this code on other clusters without
any problems, so I'm not really sure where the problem lies. They are
getting logs with the following:

Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
data directory is /mnt/pvfs2/patton/data/chem/aa1
exec directory is /mnt/pvfs2/patton/exec/chem/aa1
arch directory is /mnt/pvfs2/patton/data/chem/aa1
mpirun: killing job...

WARNING: mpirun is in the process of killing a job, but has detected an
interruption (probably control-C).

It is dangerous to interrupt mpirun while it is killing a job (proper
termination may not be guaranteed). Hit control-C again within 1
second if you really want to kill mpirun immediately.
mpirun noticed that job rank 0 with PID 14126 on node
compute-0-23.local exited on signal 15 (Terminated).
[compute-0-23.local:14124] [0,0,0]-[0,0,1] mca_oob_tcp_msg_recv: readv
failed: Connection reset by peer (104)

The job was submitted with:
##PBS -N for.chem.aa1
#PBS -l nodes=2
#PBS -l walltime=0:30:00
#PBS -m n
#PBS -j oe
#PBS -o /home/patton/logs
#PBS -e /home/patton/logs
# ------ set case specific parameters
# and setup directory structure
set time=000001_000100
set case=aa1
set type=chem
# ---- set up directories
set SCRATCH=/mnt/pvfs2/patton
mkdir -p $SCRATCH

set datadir=$SCRATCH/data/$type/$case
set execdir=$SCRATCH/exec/$type/$case
set archdir=$SCRATCH/data/$type/$case
set les_output=les.$type.$case.out.$time

set compdir=$HOME/compile/$type/$case
#set compdir=$HOME/compile/free/aa1

echo 'data directory is ' $datadir
echo 'exec directory is ' $execdir
echo 'arch directory is ' $archdir

mkdir -p $datadir
mkdir -p $execdir
cd $execdir
rm -fr *
cp $compdir/* .
# ------- build machine file for code to read setup
# ------------ set imachine=0 for NCAR IBM SP : bluevista
# imachine=1 for NCAR IBM SP : bluesky
# imachine=2 for ASC SGI Altix : eagle
# imachine=3 for ERDC Cray XT3 : sapphire
# imachine=4 for ASC HP XC : falcon
# imachine=5 for NERSC Cray XT4 : franklin
# imachine=6 for WSU Cluster : aeolus
set imachine=6
set store_files=1
echo $imachine > mach.file
echo $store_files >> mach.file
echo $datadir >> mach.file
echo $archdir >> mach.file
# ---- submit the run
mpirun -n 2 ./lesmpi.a > $les_output
# ------ clean up
mv $execdir/u.* $datadir
mv $execdir/p.* $datadir
mv $execdir/his.* $datadir
cp $execdir/$les_output $datadir
echo 'job ended '
(its possible this particular script doesn't match this particular
error...The user ran the job, and this is what I assembled from
conversations with him. In any case, its representative to the jobs
he's running, and they're all returning similar errors.)

The error occurs at varying time steps in the runs, and if run without
MPI, it runs fine to completion.

Here's the version info:

[kusznir_at_aeolus ~]$ rpm -qa |grep pgi

The OpenMPI rpms were built from the supplied spec (or nearly so,
anyway) with the following command line:
CC=pgcc CXX=pgCC F77=pgf77 FC=pgf90 rpmbuild -bb --define 'install_in_opt 1' --d
efine 'install_modulefile 1' --define 'modules_rpm_name environment-modules' --d
efine 'build_all_in_one_rpm 0' --define 'configure_options --with-tm=/opt/torqu
e' --define '_name openmpi-pgi' --define 'use_default_rpm_opt_flags 0' openmpi.s

Any suggestions?