Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.
From: Rahul Nabar (rpnabar_at_[hidden])
Date: 2009-03-31 19:50:22

I've a strange OpenMPI/Torque problem while trying to run a job on our
Opteron-SC-1435 based cluster:

Each node has 8 cpus.

If I got to a node and run like so then the job works:

mpirun -np 6 ${EXE_PATH}/${DACAPOEXE_PAR} ${ARGS}

Same job if I submit through PBS/Torque then it starts running but the
individual processes keep crashing:

mpirun -np 6 ${EXE_PATH}/${DACAPOEXE_PAR} ${ARGS}

I know that the --hostfile directive is not needed in the latest
torque-OpenMPI jobs.

I also tried including:

mpirun -np 6 --hosts node17,node17,node17,node17,node17,node17

Still does not work.

What could be going wrong? Are there other things I need to worry
about when PBS steps in? Any tips?

The ${DACAPOEXE_PAR} refers to a fortran executable for the
computational chemistry code DACAPO.

What;s the differences between submitting a job on a node via mpirun
directly vs via Torque. Shouldn't these both be transparent to the
fortran calls. I am assuming don't have to dig into the fortran code.
Any debug tips?