It is very hard to debug the problem with so little information. We regularly run OMPI jobs on Torque without issue.

Are you getting an allocation from somewhere for the nodes? If so, are you using Moab to get it? Do you have a $PBS_NODEFILE in your environment?

I have no idea why your processes are crashing when run via Torque - are you sure that the processes themselves crash? Are they segfaulting - if so, can you use gdb to find out where?

Information would be most helpful - the information we really need is specified here: http://www.open-mpi.org/community/help/

Thanks
Ralph


On Mar 31, 2009, at 5:50 PM, Rahul Nabar wrote:

I've a strange OpenMPI/Torque problem while trying to run a job on our
Opteron-SC-1435 based cluster:

Each node has 8 cpus.

If I got to a node and run like so then the job works:

mpirun -np 6 ${EXE_PATH}/${DACAPOEXE_PAR} ${ARGS}

Same job if I submit through PBS/Torque then it starts running but the
individual processes keep crashing:

mpirun -np 6 ${EXE_PATH}/${DACAPOEXE_PAR} ${ARGS}

I know that the --hostfile directive is not needed in the latest
torque-OpenMPI jobs.

I also tried including:

mpirun -np 6 --hosts node17,node17,node17,node17,node17,node17
${EXE_PATH}/${DACAPOEXE_PAR} ${ARGS}

Still does not work.

What could be going wrong? Are there other things I need to worry
about when PBS steps in? Any tips?

The ${DACAPOEXE_PAR} refers to a fortran executable for the
computational chemistry code DACAPO.

What;s the differences between submitting a job on a node via mpirun
directly vs via Torque. Shouldn't these both be transparent to the
fortran calls. I am assuming don't have to dig into the fortran code.
Any debug tips?

Thanks!

--
Rahul
_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users