Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] --without-tm [SEC=UNCLASSIFIED]
From: DOHERTY, Greg (gdz_at_[hidden])
Date: 2011-02-21 00:50:37

In order to be able to checkpoint openmpi jobs with blcr, we have
configured openmpi as follows

./configure --prefix=/data1/packages/openmpi/1.5.1-blcr-without-tm
--disable-openib-connectx-xrc --disable-openib-rdmacm --with-ft=cr
--enable-mpi-threads --enable-ft-thread --with-blcr=/usr
--with-blcr-libdir=/usr/include --without-tm

When used in conjunction with torque2.5.3, we are able to start the
following job with 8 cores on one node, but if we try to start the same
job with 4 cores on each of two nodes, the job starts 4 cores on the
primary node, but not the remaining 4 cores on the second node.

$ cat PBStest
#PBS -c enabled
#PBS -l walltime=25:00:00
#PBS -l nodes=2:ppn=4
#PBS -m ae
#PBS -M gdz_at_[hidden]
#PBS -N Prob8
#PBS -r n
#PBS -q blcrq
source /etc/profile.d/
module load mpi/openmpi_1.5-blcr-without-tm
NN=`cat $PBS_NODEFILE | wc -l`
cat $PBS_NODEFILE > hostfile
echo "NN = $NN "
which mpirun
mpirun -am ft-enable-cr -machinefile hostfile ex5mpi testData
The hostfile correctly lists the primary node 4 times, and then the
second node 4 times.

When openmpi is built --with-tm, which is the default if --without-tm is
not specified, the job correctly starts on the 8 cores spread across the
4 nodes.

blcr needs cr_mpirun to start the job without torque support to be able
to checkpoint the mpi job correctly.

My question is whether it is possible for the script above to be
modified in order to start on multiple nodes if openmpi has been built
with --without-tm and, if so, what needs to be added or deleted from the
I have tried -mca plm ^tm with openmpi built --with-tm which also will
not start the second 4 mpi ranks.

Any suggestions gratefully accepted.
Greg Doherty