Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-03-31 21:56:19


It is very hard to debug the problem with so little information. We
regularly run OMPI jobs on Torque without issue.

Are you getting an allocation from somewhere for the nodes? If so, are
you using Moab to get it? Do you have a $PBS_NODEFILE in your
environment?

I have no idea why your processes are crashing when run via Torque -
are you sure that the processes themselves crash? Are they segfaulting
- if so, can you use gdb to find out where?

Information would be most helpful - the information we really need is
specified here: http://www.open-mpi.org/community/help/

Thanks
Ralph

On Mar 31, 2009, at 5:50 PM, Rahul Nabar wrote:

> I've a strange OpenMPI/Torque problem while trying to run a job on our
> Opteron-SC-1435 based cluster:
>
> Each node has 8 cpus.
>
> If I got to a node and run like so then the job works:
>
> mpirun -np 6 ${EXE_PATH}/${DACAPOEXE_PAR} ${ARGS}
>
> Same job if I submit through PBS/Torque then it starts running but the
> individual processes keep crashing:
>
> mpirun -np 6 ${EXE_PATH}/${DACAPOEXE_PAR} ${ARGS}
>
> I know that the --hostfile directive is not needed in the latest
> torque-OpenMPI jobs.
>
> I also tried including:
>
> mpirun -np 6 --hosts node17,node17,node17,node17,node17,node17
> ${EXE_PATH}/${DACAPOEXE_PAR} ${ARGS}
>
> Still does not work.
>
> What could be going wrong? Are there other things I need to worry
> about when PBS steps in? Any tips?
>
> The ${DACAPOEXE_PAR} refers to a fortran executable for the
> computational chemistry code DACAPO.
>
> What;s the differences between submitting a job on a node via mpirun
> directly vs via Torque. Shouldn't these both be transparent to the
> fortran calls. I am assuming don't have to dig into the fortran code.
> Any debug tips?
>
> Thanks!
>
> --
> Rahul
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users