Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI job initializing problem
From: Jeff Squyres (jsquyres) (jsquyres_at_[hidden])
Date: 2014-03-03 17:51:24

On Mar 3, 2014, at 1:48 PM, Beichuan Yan <beichuan.yan_at_[hidden]> wrote:

> 1. After sysadmin installed libibverbs-devel package, I build Open MPI 1.7.4 successfully with the command:
> ./configure --prefix=/work4/projects/openmpi/openmpi-1.7.4-gcc-compilers-4.7.3 --with-tm=/opt/pbs/default --with-verbs=/hafs_x86_64/devel/usr --with-verbs-libdir=/hafs_x86_64/devel/usr/lib64
> 2. Then I rebuild and run my job in hybrid MPI/OPENMP mode: each compute node only runs 1 process (this 1 process runs 16 OPENMP threads), it can get initialized and run well each time with $TCP setting as follows, this is great:
> TCP="--mca btl_tcp_if_include"
> mpirun $TCP -np 16 -hostfile $PBS_NODEFILE ./paraEllip3d input.txt

If you're using the native verbs API, you don't need that TCP clause.

Also, if you're running in a PBS job, you don't need the -hostfile clause. And if you're running one process per core in the allocated PBS job, you can skip the -np clause, too. You should be able to run with:

    mpirun ./paraEllip3d input.txt

If you want one process per server, then

    mpirun -np <num_servers> --map-by node ./paraEliip3d input.txt

> 3. Then I test pure-MPI mode: OPENMP is turned off, and each compute node runs 16 processes (clearly shared-memory of MPI is used). Four combinations of "TMPDIR" and "TCP" are tested:
> case 1:
> #export TMPDIR=/home/yanb/tmp
> TCP="--mca btl_tcp_if_include"
> mpirun $TCP -np 64 -npernode 16 -hostfile $PBS_NODEFILE ./paraEllip3d input.txt
> output:
> Start Prologue v2.5 Mon Mar 3 15:47:16 EST 2014
> End Prologue v2.5 Mon Mar 3 15:47:16 EST 2014
> -bash: line 1: 448597 Terminated /var/spool/PBS/mom_priv/jobs/602244.service12.SC
> Start Epilogue v2.5 Mon Mar 3 15:50:51 EST 2014
> Statistics cpupercent=0,cput=00:00:00,mem=7028kb,ncpus=128,vmem=495768kb,walltime=00:03:24
> End Epilogue v2.5 Mon Mar 3 15:50:52 EST 2014

It looks like you have two general cases:

1. The job fails for no apparent reason (like above), or
2. The job complains that your TMPDIR is on a shared filesystem


I think the real issue, then, is to figure out why your jobs are failing with no output.

Is there anything in the stderr output?

Jeff Squyres
For corporate legal information go to: