Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.
From: Rahul Nabar (rpnabar_at_[hidden])
Date: 2009-03-31 22:36:30


2009/3/31 Ralph Castain <rhc_at_[hidden]>:
> It is very hard to debug the problem with so little information. We
> regularly run OMPI jobs on Torque without issue.

Another small thing that I noticed. Not sure if it is relevant.

When the job starts running there is an orte process. The args to this
process are slightly different depending on whether the job was
submitted with Torque or directly on a node. Could this be an issue?
Just a thought.

The essential difference seems that the torque run has the
--no-daemonize option whereas the direct run has a --set-sid option. I
got these via ps after I submitted an interactive Torque job.

Do these matter at all? Full ps output snippets reproduced below. Some
other numbers also seem different on closer inspection but that might
be by design.

###############via Torque; segfaults. ##################
rpnabar 11287 0.1 0.0 24680 1828 ? Ss 21:04 0:00 orted
--no-daemonize --bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0
--nodename node17 --universe rpnabar_at_node17:default-universe-11286
--nsreplica "0.0.0;tcp://10.0.0.17:45839" --gprreplica
"0.0.0;tcp://10.0.0.17:45839"
######################################################

##############direct MPI run; this works OK################
rpnabar 11026 0.0 0.0 24676 1712 ? Ss 20:52 0:00 orted
--bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename
node17 --universe rpnabar_at_node17:default-universe-11024 --nsreplica
"0.0.0;tcp://10.0.0.17:34716" --gprreplica
"0.0.0;tcp://10.0.0.17:34716" --set-sid
##########################################################