Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi/pbsdsh/Torque problem
From: Reuti (reuti_at_[hidden])
Date: 2011-04-03 11:12:16


Am 03.04.2011 um 16:56 schrieb Ralph Castain:

> On Apr 3, 2011, at 8:14 AM, Laurence Marks wrote:
>
>> Let me expand on this slightly (in response to Ralph Castain's posting
>> -- I had digest mode set). As currently constructed a shellscript in
>> Wien2k (www.wien2k.at) launches a series of tasks using
>>
>> ($remote $remotemachine "cd $PWD;$t $ttt;rm -f .lock_$lockfile[$p]")
>>>> .time1_$loop &
>>
>> where the standard setting for "remote" is "ssh", remotemachine is the
>> appropriate host, "t" is "time" and "ttt" is a concatenation of
>> commands, for instance when using 2 cores on one node for Task1, 2
>> cores on 2 nodes for Task2 and 2 cores on 1 node for Task3
>>
>> Task1:
>> mpirun -v -x LD_LIBRARY_PATH -x PATH -np 2 -machinefile .machine1
>> /home/lma712/src/Virgin_10.1/lapw1Q_mpi lapw1Q_1.def
>> Task2:
>> mpirun -v -x LD_LIBRARY_PATH -x PATH -np 4 -machinefile .machine2
>> /home/lma712/src/Virgin_10.1/lapw1Q_mpi lapw1Q_2.def
>> Task3:
>> mpirun -v -x LD_LIBRARY_PATH -x PATH -np 2 -machinefile .machine3
>> /home/lma712/src/Virgin_10.1/lapw1Q_mpi lapw1Q_3.def
>>
>> This is a stable script, works under SGI, linux, mvapich and many
>> others using ssh or rsh (although I've never myself used it with rsh).
>> It is general purpose, i.e. will work to run just 1 task on 8x8
>> nodes/cores or 8 parallel tasks on 8 nodes all with 8 cores or any
>> scatter of nodes/cores.
>>
>> According to some, ssh is becoming obsolete within supercomputers and
>> the "replacement" is pbsdsh at least under Torque.
>
> Somebody is playing an April Fools joke on you. The majority of supercomputers use ssh as their sole launch mechanism, and I have seen no
> indication that anyone intends to change that situation. That said, Torque is certainly popular and a good environment.

I operate my Linux clusters without `ssh` or `rsh`. I use SGE's `qrsh` instead. How will you get a tight integration with correct accounting and job control otherwise? This might be different when you have an AIX or NEC SX machine, as they provide additonal control mechanisms.

-- Reuti

>> Getting pbsdsh is
>> certainly not as simple as the documentation I've seen. To get it to
>> even partially work I am using for "remote" a script "pbsh" which
>> creates an executable bash file where HOME, PATH, LD_LIBRARY_PATH etc
>> as well as the PBS environmental variables listed at the bottom of
>> http://www.bear.bham.ac.uk/bluebear/pbsdsh.shtml plus PBS_NODEFILE to
>> a file $PBS_O_WORKDIR/.tmp_$1 followed by the relevant command and
>> then runs
>>
>> pbsdsh -h $1 /bin/bash -lc " $PBS_O_WORKDIR/.tmp_$1 "
>>
>> This works fine so long as Task2 does not have 2 nodes (probably 3 as
>> well, I've not tested this). If it does there is a communications
>> failure with nothing launched on the 2nd node of Task2.
>>
>> I'm including the script below, as maybe there are some other
>> environmental variables needed or some should not be there in order to
>> properly rebuilt the environment so things will work. (And yes, I know
>> there should be tests to see if the variables are set first and so
>> forth and this is not so clean, this is just an initial version.)
>
> By providing all those PBS-related envars to OMPI, you are causing OMPI to think it should use Torque as the launch mechanism. Unfortunately, that won't work in this scenario.
>
> When you start a Torque job (get an allocation etc.), Torque puts you on one of the allocated nodes and creates a "sister mom" on that node. This is your job's "master node". All Torque-based launches must take place from that location.
>
> So when you pbsdsh to another node and attempt to execute mpirun with those envars set, mpirun attempts to contact the local "sister mom" so it can order the launch of any daemons on other nodes....only the "sister mom" isn't there! So the connection fails and mpirun aborts.
>
> If mpirun is -only- launching procs on the local node, then it doesn't need to launch another daemon (as mpirun will host the local procs itself), and so it doesn't attempt to contact the "sister mom" and the comm failure doesn't occur.
>
> What I still don't understand is why you are trying to do it this way. Why not just run
>
> time mpirun -v -x LD_LIBRARY_PATH -x PATH -np 2 -machinefile .machineN /home/lma712/src/Virgin_10.1/lapw1Q_mpi lapw1Q_1.def
>
> where machineN contains the names of the nodes where you want the MPI apps to execute? mpirun will only execute apps on those nodes, so this accomplishes the same thing as your script - only with a lot less pain.
>
> Your script would just contain a sequence of these commands, each with its number of procs and machinefile as required.
>
> Actually, it would be pretty much identical to the script I use when doing scaling tests...
>
>
>>
>> ----------
>> # Script to replace ssh by pbsdsh
>> # Beta version, April 2011, L. D. Marks
>> #
>> # Remove old file -- needed !
>> rm -f $PBS_O_WORKDIR/.tmp_$1
>>
>> # Create a script that exports the environment we have
>> # This may not be enough
>> echo #!/bin/bash > $PBS_O_WORKDIR/.tmp_$1
>> echo source $HOME/.bashrc >> $PBS_O_WORKDIR/.tmp_$1
>> echo cd $PBS_O_WORKDIR >> $PBS_O_WORKDIR/.tmp_$1
>> echo export PATH=$PBS_O_PATH >> $PBS_O_WORKDIR/.tmp_$1
>> echo export TMPDIR=$TMPDIR >> $PBS_O_WORKDIR/.tmp_$1
>> echo export SCRATCH=$SCRATCH >> $PBS_O_WORKDIR/.tmp_$1
>> echo export LD_LIBRARY_PATH=$LD_LIBRARY_PATH >> $PBS_O_WORKDIR/.tmp_$1
>>
>> # Openmpi needs to have this defined, even if we don't use it
>> echo export PBS_NODEFILE=$PBS_NODEFILE >> $PBS_O_WORKDIR/.tmp_$1
>> echo export PBS_ENVIRONMENT=$PBS_ENVIRONMENT >> $PBS_O_WORKDIR/.tmp_$1
>> echo export PBS_JOBCOOKIE=$PBS_JOBCOOKIE >> $PBS_O_WORKDIR/.tmp_$1
>> echo export PBS_JOBID=$PBS_JOBID >> $PBS_O_WORKDIR/.tmp_$1
>> echo export PBS_JOBNAME=$PBS_JOBNAME >> $PBS_O_WORKDIR/.tmp_$1
>> echo export PBS_MOMPORT=$PBS_MOMPORT >> $PBS_O_WORKDIR/.tmp_$1
>> echo export PBS_NODENUM=$PBS_NODENUM >> $PBS_O_WORKDIR/.tmp_$1
>> echo export PBS_O_HOME=$PBS_O_HOME >> $PBS_O_WORKDIR/.tmp_$1
>> echo export PBS_O_HOST=$PBS_O_HOST >> $PBS_O_WORKDIR/.tmp_$1
>> echo export PBS_O_LANG=$PBS_O_LANG >> $PBS_O_WORKDIR/.tmp_$1
>> echo export PBS_O_LOGNAME=$PBS_O_LOGNAME >> $PBS_O_WORKDIR/.tmp_$1
>> echo export PBS_O_MAIL=$PBS_O_MAIL >> $PBS_O_WORKDIR/.tmp_$1
>> echo export PBS_O_PATH=$PBS_O_PATH >> $PBS_O_WORKDIR/.tmp_$1
>> echo export PBS_O_QUEUE=$PBS_O_QUEUE >> $PBS_O_WORKDIR/.tmp_$1
>> echo export PBS_O_SHELL=$PBS_O_SHELL >> $PBS_O_WORKDIR/.tmp_$1
>> echo export PBS_O_WORKDIR=$PBS_O_WORKDIR >> $PBS_O_WORKDIR/.tmp_$1
>> echo export PBS_QUEUE=$PBS_QUEUE >> $PBS_O_WORKDIR/.tmp_$1
>> echo export PBS_TASKNUM=$PBS_TASKNUM >> $PBS_O_WORKDIR/.tmp_$1
>> echo export PBS_VNODENUM=$PBS_VNODENUM >> $PBS_O_WORKDIR/.tmp_$1
>>
>> # Now the command we want to run
>> echo $2 >> $PBS_O_WORKDIR/.tmp_$1
>>
>> # Make it executable
>> chmod a+x $PBS_O_WORKDIR/.tmp_$1
>>
>> pbsdsh -h $1 /bin/bash -lc " $PBS_O_WORKDIR/.tmp_$1 "
>>
>> #Cleanup if needed (commented out for debugging)
>> #rm $PBS_O_WORKDIR/.tmp_$1
>>
>>
>> On Sat, Apr 2, 2011 at 9:36 PM, Laurence Marks <L-marks_at_[hidden]> wrote:
>>> I have a problem which may or may not be openmpi, but since this list
>>> was useful before with a race condition I am posting.
>>>
>>> I am trying to use pbsdsh as a ssh replacement, pushed by sysadmins as
>>> Torque does not know about ssh tasks launched from a task. In a simple
>>> case, a script launches three mpi tasks in parallel,
>>>
>>> Task1: NodeA
>>> Task2: NodeB and NodeC
>>> Task3: NodeD
>>>
>>> (some cores on each, all handled correctly). Reproducible (but with
>>> different nodes and numbers of cores) Task1 and Task3 work fine, the
>>> mpi task starts on NodeB but nothing starts on NodeC, it appears that
>>> NodeC does not communicate. It does not have to be this it could be
>>>
>>> Task1: NodeA NodeB
>>> Task2: NodeC NodeD
>>>
>>> Here NodeC will start and it looks as if NodeD never starts anything.
>>> I've also run it with 4 Tasks (1,3,4 work) and if Task2 only uses one
>>> Node (number of cores do not matter) it is fine.
>>>
>>> --
>>> Laurence Marks
>>> Department of Materials Science and Engineering
>>> MSE Rm 2036 Cook Hall
>>> 2220 N Campus Drive
>>> Northwestern University
>>> Evanston, IL 60208, USA
>>> Tel: (847) 491-3996 Fax: (847) 491-7820
>>> email: L-marks at northwestern dot edu
>>> Web: www.numis.northwestern.edu
>>> Chair, Commission on Electron Crystallography of IUCR
>>> www.numis.northwestern.edu/
>>> Research is to see what everybody else has seen, and to think what
>>> nobody else has thought
>>> Albert Szent-Györgi
>>>
>>
>>
>>
>> --
>> Laurence Marks
>> Department of Materials Science and Engineering
>> MSE Rm 2036 Cook Hall
>> 2220 N Campus Drive
>> Northwestern University
>> Evanston, IL 60208, USA
>> Tel: (847) 491-3996 Fax: (847) 491-7820
>> email: L-marks at northwestern dot edu
>> Web: www.numis.northwestern.edu
>> Chair, Commission on Electron Crystallography of IUCR
>> www.numis.northwestern.edu/
>> Research is to see what everybody else has seen, and to think what
>> nobody else has thought
>> Albert Szent-Györgi
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>