Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi/pbsdsh/Torque problem
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-04-03 18:20:07


On Apr 3, 2011, at 4:08 PM, Reuti wrote:

> Am 03.04.2011 um 23:59 schrieb David Singleton:
>
>> On 04/04/2011 12:56 AM, Ralph Castain wrote:
>>>
>>> What I still don't understand is why you are trying to do it this way. Why not just run
>>>
>>> time mpirun -v -x LD_LIBRARY_PATH -x PATH -np 2 -machinefile .machineN /home/lma712/src/Virgin_10.1/lapw1Q_mpi lapw1Q_1.def
>>>
>>> where machineN contains the names of the nodes where you want the MPI apps to execute? mpirun will only execute apps on those nodes, so this accomplishes the same thing as your script - only with a lot less pain.
>>>
>>> Your script would just contain a sequence of these commands, each with its number of procs and machinefile as required.
>>>
>>
>> Maybe I missed why this suggestion of forgetting about the ssh/pbsdsh altogether
>> was not feasible? Just use mpirun (with its great tm support!) to distribute
>> MPI jobs.
>
> Wien2k has a two stage startup, e.g. for 16 slots:
>
> a) start 4 times `ssh` in the background to go to some of the granted nodes
> b) use there on each node `mpirun` to start processes on the remaining nodes, 3 for each call

Sounds to me like someone should fix wien2k... :-)

>
> Problems:
>
> 1) control `ssh` under Torque
> 2) provide a partially hostlist to `mpirun`, maybe by disabling the default tight integration

Enough for me - this appears all caused by a poorly-executed application, frankly.

>
> -- Reuti
>
>
>> A simple example:
>>
>> vayu1:~/MPI > qsub -lncpus=24,vmem=24gb,walltime=10:00 -wd -I
>> qsub: waiting for job 574900.vu-pbs to start
>> qsub: job 574900.vu-pbs ready
>>
>> [dbs900_at_v250 ~/MPI]$ wc -l $PBS_NODEFILE
>> 24
>> [dbs900_at_v250 ~/MPI]$ head -12 $PBS_NODEFILE > m1
>> [dbs900_at_v250 ~/MPI]$ tail -12 $PBS_NODEFILE > m2
>> [dbs900_at_v250 ~/MPI]$ mpirun --machinefile m1 ./a2a143 120000 30 & mpirun --machinefile m2 ./pp143
>>
>>
>> Check how the processes are distributed ...
>>
>> vayu1:~ > qps 574900.vu-pbs
>> Node 0: v250:
>> PID S RSS VSZ %MEM TIME %CPU COMMAND
>> 11420 S 2104 10396 0.0 00:00:00 0.0 -tcsh
>> 11421 S 620 10552 0.0 00:00:00 0.0 pbs_demux
>> 12471 S 2208 49324 0.0 00:00:00 0.9 /apps/openmpi/1.4.3/bin/mpirun --machinefile m1 ./a2a143 120000 30
>> 12472 S 2116 49312 0.0 00:00:00 0.0 /apps/openmpi/1.4.3/bin/mpirun --machinefile m2 ./pp143
>> 12535 R 270160 565668 1.0 00:00:02 82.4 ./a2a143 120000 30
>> 12536 R 270032 565536 1.0 00:00:02 81.4 ./a2a143 120000 30
>> 12537 R 270012 565528 1.0 00:00:02 87.3 ./a2a143 120000 30
>> 12538 R 269992 565532 1.0 00:00:02 93.3 ./a2a143 120000 30
>> 12539 R 269980 565516 1.0 00:00:02 81.4 ./a2a143 120000 30
>> 12540 R 270008 565516 1.0 00:00:02 86.3 ./a2a143 120000 30
>> 12541 R 270008 565516 1.0 00:00:02 96.3 ./a2a143 120000 30
>> 12542 R 272064 567568 1.0 00:00:02 91.3 ./a2a143 120000 30
>> Node 1: v251:
>> PID S RSS VSZ %MEM TIME %CPU COMMAND
>> 10367 S 1872 40648 0.0 00:00:00 0.0 orted -mca ess env -mca orte_ess_jobid 1444413440 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri "1444413440.0;tcp://10.1.3.58:37339"
>> 10368 S 1868 40648 0.0 00:00:00 0.0 orted -mca ess env -mca orte_ess_jobid 1444347904 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 --hnp-uri "1444347904.0;tcp://10.1.3.58:39610"
>> 10372 R 271112 567556 1.0 00:00:04 74.5 ./a2a143 120000 30
>> 10373 R 271036 567564 1.0 00:00:04 71.5 ./a2a143 120000 30
>> 10374 R 271032 567560 1.0 00:00:04 66.5 ./a2a143 120000 30
>> 10375 R 273112 569612 1.1 00:00:04 68.5 ./a2a143 120000 30
>> 10378 R 552280 840712 2.2 00:00:04 100 ./pp143
>> 10379 R 552280 840708 2.2 00:00:04 100 ./pp143
>> 10380 R 552328 841576 2.2 00:00:04 100 ./pp143
>> 10381 R 552788 841216 2.2 00:00:04 99.3 ./pp143
>> Node 2: v252:
>> PID S RSS VSZ %MEM TIME %CPU COMMAND
>> 10152 S 1908 40780 0.0 00:00:00 0.0 orted -mca ess env -mca orte_ess_jobid 1444347904 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 3 --hnp-uri "1444347904.0;tcp://10.1.3.58:39610"
>> 10156 R 552384 840200 2.2 00:00:07 99.3 ./pp143
>> 10157 R 551868 839692 2.2 00:00:06 99.3 ./pp143
>> 10158 R 551400 839184 2.2 00:00:07 100 ./pp143
>> 10159 R 551436 839184 2.2 00:00:06 98.3 ./pp143
>> 10160 R 551760 839692 2.2 00:00:07 100 ./pp143
>> 10161 R 551788 839824 2.2 00:00:07 97.3 ./pp143
>> 10162 R 552256 840332 2.2 00:00:07 100 ./pp143
>> 10163 R 552216 840340 2.2 00:00:07 99.3 ./pp143
>>
>>
>> You would have to do something smarter to get correct process binding etc.
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users