Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi/pbsdsh/Torque problem
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-04-03 12:41:53


On Apr 3, 2011, at 9:34 AM, Laurence Marks wrote:

> On Sun, Apr 3, 2011 at 9:56 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>>
>> On Apr 3, 2011, at 8:14 AM, Laurence Marks wrote:
>>
>>> Let me expand on this slightly (in response to Ralph Castain's posting
>>> -- I had digest mode set). As currently constructed a shellscript in
>>> Wien2k (www.wien2k.at) launches a series of tasks using
>>>
>>> ($remote $remotemachine "cd $PWD;$t $ttt;rm -f .lock_$lockfile[$p]")
>>>>> .time1_$loop &
>>>
>>> where the standard setting for "remote" is "ssh", remotemachine is the
>>> appropriate host, "t" is "time" and "ttt" is a concatenation of
>>> commands, for instance when using 2 cores on one node for Task1, 2
>>> cores on 2 nodes for Task2 and 2 cores on 1 node for Task3
>>>
>>> Task1:
>>> mpirun -v -x LD_LIBRARY_PATH -x PATH -np 2 -machinefile .machine1
>>> /home/lma712/src/Virgin_10.1/lapw1Q_mpi lapw1Q_1.def
>>> Task2:
>>> mpirun -v -x LD_LIBRARY_PATH -x PATH -np 4 -machinefile .machine2
>>> /home/lma712/src/Virgin_10.1/lapw1Q_mpi lapw1Q_2.def
>>> Task3:
>>> mpirun -v -x LD_LIBRARY_PATH -x PATH -np 2 -machinefile .machine3
>>> /home/lma712/src/Virgin_10.1/lapw1Q_mpi lapw1Q_3.def
>>>
>>> This is a stable script, works under SGI, linux, mvapich and many
>>> others using ssh or rsh (although I've never myself used it with rsh).
>>> It is general purpose, i.e. will work to run just 1 task on 8x8
>>> nodes/cores or 8 parallel tasks on 8 nodes all with 8 cores or any
>>> scatter of nodes/cores.
>>>
>>> According to some, ssh is becoming obsolete within supercomputers and
>>> the "replacement" is pbsdsh at least under Torque.
>>
>> Somebody is playing an April Fools joke on you. The majority of supercomputers use ssh as their sole launch mechanism, and I have seen no indication that anyone intends to change that situation. That said, Torque is certainly popular and a good environment.
>
> Alas, it is not an April fools joke, to quote from
> http://www.bear.bham.ac.uk/bluebear/pbsdsh.shtml
> "pbsdsh can be used as a replacement for an ssh or rsh command which
> invokes a user command on a worker machine. Some applications expect
> the availability of rsh or ssh in order to invoke parts of the
> computation on the sister worker nodes of the main worker. Using
> pbsdsh instead is necessary on this cluster because direct use of ssh
> or rsh is not allowed, for accounting and security reasons."

Ah, but that is an administrative decision by a single organization - not the global supercomputer industry. :-)

>
> I am not using that computer. A scenario that I have come across is
> that when a msub job is killed because it has exceeded it's Walltime
> mpi tasks spawned by ssh may not be terminated because (so I am told)
> Torque does not know about them.

Not true with OMPI. Torque will kill mpirun, which will in turn cause all MPI procs to die. Yes, it's true that Torque won't know about the MPI procs itself. However, OMPI is designed such that termination of mpirun by the resource manager will cause all apps to die.

>
>>
>>> Getting pbsdsh is
>>> certainly not as simple as the documentation I've seen. To get it to
>>> even partially work I am using for "remote" a script "pbsh" which
>>> creates an executable bash file where HOME, PATH, LD_LIBRARY_PATH etc
>>> as well as the PBS environmental variables listed at the bottom of
>>> http://www.bear.bham.ac.uk/bluebear/pbsdsh.shtml plus PBS_NODEFILE to
>>> a file $PBS_O_WORKDIR/.tmp_$1 followed by the relevant command and
>>> then runs
>>>
>>> pbsdsh -h $1 /bin/bash -lc " $PBS_O_WORKDIR/.tmp_$1 "
>>>
>>> This works fine so long as Task2 does not have 2 nodes (probably 3 as
>>> well, I've not tested this). If it does there is a communications
>>> failure with nothing launched on the 2nd node of Task2.
>>>
>>> I'm including the script below, as maybe there are some other
>>> environmental variables needed or some should not be there in order to
>>> properly rebuilt the environment so things will work. (And yes, I know
>>> there should be tests to see if the variables are set first and so
>>> forth and this is not so clean, this is just an initial version.)
>>
>> By providing all those PBS-related envars to OMPI, you are causing OMPI to think it should use Torque as the launch mechanism. Unfortunately, that won't work in this scenario.
>>
>> When you start a Torque job (get an allocation etc.), Torque puts you on one of the allocated nodes and creates a "sister mom" on that node. This is your job's "master node". All Torque-based launches must take place from that location.
>>
>> So when you pbsdsh to another node and attempt to execute mpirun with those envars set, mpirun attempts to contact the local "sister mom" so it can order the launch of any daemons on other nodes....only the "sister mom" isn't there! So the connection fails and mpirun aborts.
>>
>> If mpirun is -only- launching procs on the local node, then it doesn't need to launch another daemon (as mpirun will host the local procs itself), and so it doesn't attempt to contact the "sister mom" and the comm failure doesn't occur.
>>
>> What I still don't understand is why you are trying to do it this way. Why not just run
>>
>> time mpirun -v -x LD_LIBRARY_PATH -x PATH -np 2 -machinefile .machineN /home/lma712/src/Virgin_10.1/lapw1Q_mpi lapw1Q_1.def
>>
>> where machineN contains the names of the nodes where you want the MPI apps to execute? mpirun will only execute apps on those nodes, so this accomplishes the same thing as your script - only with a lot less pain.
>>
>> Your script would just contain a sequence of these commands, each with its number of procs and machinefile as required.
>>
>> Actually, it would be pretty much identical to the script I use when doing scaling tests...
>
> This can be done, and in fact I have a job running where non-mpi is
> launched using pbsdsh but all the mpi is launched locally and this
> seems to be working. This may be a viable, general solution but there
> could also be issues with SCRATCH and other directories. In principle
> there could also be issues with launching N mpi tasks from one node.
> The executables I am using work well with very scattered cores, e.g.
> using procs=64 or procs=256 but (at least with the system I am using)
> I may only end up with 1 or 2 cores on the local node where the job
> starts. (I've asked the sys admin people to find a way to do this
> better, e.g. prefer launching from the node with the largest number of
> cores available which I think can be done, but they do not have this
> setup as yet.)

Running multiple mpirun's in parallel on the same nodes definitely won't work, at least with OMPI. You would have to either ensure that each mpirun is launching apps on unique nodes (i.e., no two mpiruns have apps on the same node), or execute them serially.

That is a pretty common constraint, not just one on OMPI - but I can't speak as definitively about the other MPIs out there.

>>
>>
>>>
>>> ----------
>>> # Script to replace ssh by pbsdsh
>>> # Beta version, April 2011, L. D. Marks
>>> #
>>> # Remove old file -- needed !
>>> rm -f $PBS_O_WORKDIR/.tmp_$1
>>>
>>> # Create a script that exports the environment we have
>>> # This may not be enough
>>> echo #!/bin/bash > $PBS_O_WORKDIR/.tmp_$1
>>> echo source $HOME/.bashrc >> $PBS_O_WORKDIR/.tmp_$1
>>> echo cd $PBS_O_WORKDIR >> $PBS_O_WORKDIR/.tmp_$1
>>> echo export PATH=$PBS_O_PATH >> $PBS_O_WORKDIR/.tmp_$1
>>> echo export TMPDIR=$TMPDIR >> $PBS_O_WORKDIR/.tmp_$1
>>> echo export SCRATCH=$SCRATCH >> $PBS_O_WORKDIR/.tmp_$1
>>> echo export LD_LIBRARY_PATH=$LD_LIBRARY_PATH >> $PBS_O_WORKDIR/.tmp_$1
>>>
>>> # Openmpi needs to have this defined, even if we don't use it
>>> echo export PBS_NODEFILE=$PBS_NODEFILE >> $PBS_O_WORKDIR/.tmp_$1
>>> echo export PBS_ENVIRONMENT=$PBS_ENVIRONMENT >> $PBS_O_WORKDIR/.tmp_$1
>>> echo export PBS_JOBCOOKIE=$PBS_JOBCOOKIE >> $PBS_O_WORKDIR/.tmp_$1
>>> echo export PBS_JOBID=$PBS_JOBID >> $PBS_O_WORKDIR/.tmp_$1
>>> echo export PBS_JOBNAME=$PBS_JOBNAME >> $PBS_O_WORKDIR/.tmp_$1
>>> echo export PBS_MOMPORT=$PBS_MOMPORT >> $PBS_O_WORKDIR/.tmp_$1
>>> echo export PBS_NODENUM=$PBS_NODENUM >> $PBS_O_WORKDIR/.tmp_$1
>>> echo export PBS_O_HOME=$PBS_O_HOME >> $PBS_O_WORKDIR/.tmp_$1
>>> echo export PBS_O_HOST=$PBS_O_HOST >> $PBS_O_WORKDIR/.tmp_$1
>>> echo export PBS_O_LANG=$PBS_O_LANG >> $PBS_O_WORKDIR/.tmp_$1
>>> echo export PBS_O_LOGNAME=$PBS_O_LOGNAME >> $PBS_O_WORKDIR/.tmp_$1
>>> echo export PBS_O_MAIL=$PBS_O_MAIL >> $PBS_O_WORKDIR/.tmp_$1
>>> echo export PBS_O_PATH=$PBS_O_PATH >> $PBS_O_WORKDIR/.tmp_$1
>>> echo export PBS_O_QUEUE=$PBS_O_QUEUE >> $PBS_O_WORKDIR/.tmp_$1
>>> echo export PBS_O_SHELL=$PBS_O_SHELL >> $PBS_O_WORKDIR/.tmp_$1
>>> echo export PBS_O_WORKDIR=$PBS_O_WORKDIR >> $PBS_O_WORKDIR/.tmp_$1
>>> echo export PBS_QUEUE=$PBS_QUEUE >> $PBS_O_WORKDIR/.tmp_$1
>>> echo export PBS_TASKNUM=$PBS_TASKNUM >> $PBS_O_WORKDIR/.tmp_$1
>>> echo export PBS_VNODENUM=$PBS_VNODENUM >> $PBS_O_WORKDIR/.tmp_$1
>>>
>>> # Now the command we want to run
>>> echo $2 >> $PBS_O_WORKDIR/.tmp_$1
>>>
>>> # Make it executable
>>> chmod a+x $PBS_O_WORKDIR/.tmp_$1
>>>
>>> pbsdsh -h $1 /bin/bash -lc " $PBS_O_WORKDIR/.tmp_$1 "
>>>
>>> #Cleanup if needed (commented out for debugging)
>>> #rm $PBS_O_WORKDIR/.tmp_$1
>>>
>>>
>>> On Sat, Apr 2, 2011 at 9:36 PM, Laurence Marks <L-marks_at_[hidden]> wrote:
>>>> I have a problem which may or may not be openmpi, but since this list
>>>> was useful before with a race condition I am posting.
>>>>
>>>> I am trying to use pbsdsh as a ssh replacement, pushed by sysadmins as
>>>> Torque does not know about ssh tasks launched from a task. In a simple
>>>> case, a script launches three mpi tasks in parallel,
>>>>
>>>> Task1: NodeA
>>>> Task2: NodeB and NodeC
>>>> Task3: NodeD
>>>>
>>>> (some cores on each, all handled correctly). Reproducible (but with
>>>> different nodes and numbers of cores) Task1 and Task3 work fine, the
>>>> mpi task starts on NodeB but nothing starts on NodeC, it appears that
>>>> NodeC does not communicate. It does not have to be this it could be
>>>>
>>>> Task1: NodeA NodeB
>>>> Task2: NodeC NodeD
>>>>
>>>> Here NodeC will start and it looks as if NodeD never starts anything.
>>>> I've also run it with 4 Tasks (1,3,4 work) and if Task2 only uses one
>>>> Node (number of cores do not matter) it is fine.
>>>>
>>>> --
>>>> Laurence Marks
>>>> Department of Materials Science and Engineering
>>>> MSE Rm 2036 Cook Hall
>>>> 2220 N Campus Drive
>>>> Northwestern University
>>>> Evanston, IL 60208, USA
>>>> Tel: (847) 491-3996 Fax: (847) 491-7820
>>>> email: L-marks at northwestern dot edu
>>>> Web: www.numis.northwestern.edu
>>>> Chair, Commission on Electron Crystallography of IUCR
>>>> www.numis.northwestern.edu/
>>>> Research is to see what everybody else has seen, and to think what
>>>> nobody else has thought
>>>> Albert Szent-Györgi
>>>>
>>>
>>>
>>>
>>> --
>>> Laurence Marks
>>> Department of Materials Science and Engineering
>>> MSE Rm 2036 Cook Hall
>>> 2220 N Campus Drive
>>> Northwestern University
>>> Evanston, IL 60208, USA
>>> Tel: (847) 491-3996 Fax: (847) 491-7820
>>> email: L-marks at northwestern dot edu
>>> Web: www.numis.northwestern.edu
>>> Chair, Commission on Electron Crystallography of IUCR
>>> www.numis.northwestern.edu/
>>> Research is to see what everybody else has seen, and to think what
>>> nobody else has thought
>>> Albert Szent-Györgi
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
>
> --
> Laurence Marks
> Department of Materials Science and Engineering
> MSE Rm 2036 Cook Hall
> 2220 N Campus Drive
> Northwestern University
> Evanston, IL 60208, USA
> Tel: (847) 491-3996 Fax: (847) 491-7820
> email: L-marks at northwestern dot edu
> Web: www.numis.northwestern.edu
> Chair, Commission on Electron Crystallography of IUCR
> www.numis.northwestern.edu/
> Research is to see what everybody else has seen, and to think what
> nobody else has thought
> Albert Szent-Gyorgi
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users