Simplest soln: add -bynode to your mpirun cmd line
On Feb 20, 2011, at 10:50 PM, DOHERTY, Greg wrote:
> In order to be able to checkpoint openmpi jobs with blcr, we have
> configured openmpi as follows
> ./configure --prefix=/data1/packages/openmpi/1.5.1-blcr-without-tm
> --disable-openib-connectx-xrc --disable-openib-rdmacm --with-ft=cr
> --enable-mpi-threads --enable-ft-thread --with-blcr=/usr
> --with-blcr-libdir=/usr/include --without-tm
> When used in conjunction with torque2.5.3, we are able to start the
> following job with 8 cores on one node, but if we try to start the same
> job with 4 cores on each of two nodes, the job starts 4 cores on the
> primary node, but not the remaining 4 cores on the second node.
> $ cat PBStest
> #PBS -c enabled
> #PBS -l walltime=25:00:00
> #PBS -l nodes=2:ppn=4
> #PBS -m ae
> #PBS -M gdz_at_[hidden]
> #PBS -N Prob8
> #PBS -r n
> #PBS -q blcrq
> source /etc/profile.d/00-modules.sh
> module load mpi/openmpi_1.5-blcr-without-tm
> NN=`cat $PBS_NODEFILE | wc -l`
> cd $PBS_O_WORKDIR
> cat $PBS_NODEFILE > hostfile
> cat $PBS_NODEFILE
> echo "NN = $NN "
> which mpirun
> cd $PBS_O_WORKDIR
> mpirun -am ft-enable-cr -machinefile hostfile ex5mpi testData
> The hostfile correctly lists the primary node 4 times, and then the
> second node 4 times.
> When openmpi is built --with-tm, which is the default if --without-tm is
> not specified, the job correctly starts on the 8 cores spread across the
> 4 nodes.
> blcr needs cr_mpirun to start the job without torque support to be able
> to checkpoint the mpi job correctly.
> My question is whether it is possible for the script above to be
> modified in order to start on multiple nodes if openmpi has been built
> with --without-tm and, if so, what needs to be added or deleted from the
> I have tried -mca plm ^tm with openmpi built --with-tm which also will
> not start the second 4 mpi ranks.
> Any suggestions gratefully accepted.
> Greg Doherty
> users mailing list