Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Spawn_multiple with tight integration to SGE grid engine
From: Reuti (reuti_at_[hidden])
Date: 2012-02-01 05:49:35


Am 31.01.2012 um 21:25 schrieb Ralph Castain:

>
> On Jan 31, 2012, at 12:58 PM, Reuti wrote:
>
>>
>> Am 31.01.2012 um 20:38 schrieb Ralph Castain:
>>
>>> Not sure I fully grok this thread, but will try to provide an answer.
>>>
>>> When you start a singleton, it spawns off a daemon that is the equivalent of "mpirun". This daemon is created for the express purpose of allowing the singleton to use MPI dynamics like comm_spawn - without it, the singleton would be unable to execute those functions.
>>>
>>> The first thing the daemon does is read the local allocation, using the same methods as used by mpirun. So whatever allocation is present that mpirun would have read, the daemon will get. This includes hostfiles and SGE allocations.
>>
>> So it should honor also the default hostfile of Open MPI if running outside of SGE, i.e. from the command line?
>
> Yes

BTW: is there any default for a hostfile for Open MPI - I mean any in my home directory or /etc? When I check `man orte_hosts`, and all possible optiions are unset (like in a singleton run), it will only run local (Job is co-located with mpirun).

>>> The exception to this is when the singleton gets started in an altered environment - e.g., if SGE changes the environmental variables when launching the singleton process. We see this in some resource managers - you can get an allocation of N nodes, but when you launch a job, the envar in that job only indicates the number of nodes actually running processes in that job. In such a situation, the daemon will see the altered value as its "allocation", potentially causing confusion.
>>
>> Not sure whether I get it right. When I launch the same application with:
>>
>> "mpiexec -np1 ./Mpitest" (and get an allocation of 2+2 on the two machines):
>>
>> 27422 ? Sl 4:12 /usr/sge/bin/lx24-x86/sge_execd
>> 9504 ? S 0:00 \_ sge_shepherd-3791 -bg
>> 9506 ? Ss 0:00 \_ /bin/sh /var/spool/sge/pc15370/job_scripts/3791
>> 9507 ? S 0:00 \_ mpiexec -np 1 ./Mpitest
>> 9508 ? R 0:07 \_ ./Mpitest
>> 9509 ? Sl 0:00 \_ /usr/sge/bin/lx24-x86/qrsh -inherit -nostdin -V pc15381 orted -mca
>> 9513 ? S 0:00 \_ /home/reuti/mpitest/Mpitest --child
>>
>> 2861 ? Sl 10:47 /usr/sge/bin/lx24-x86/sge_execd
>> 25434 ? Sl 0:00 \_ sge_shepherd-3791 -bg
>> 25436 ? Ss 0:00 \_ /usr/sge/utilbin/lx24-x86/qrsh_starter /var/spool/sge/pc15381/active_jobs/3791.1/1.pc15381
>> 25444 ? S 0:00 \_ orted -mca ess env -mca orte_ess_jobid 821952512 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri
>> 25447 ? S 0:01 \_ /home/reuti/mpitest/Mpitest --child
>> 25448 ? S 0:01 \_ /home/reuti/mpitest/Mpitest --child
>>
>> This is what I expect (main + 1 child, other node gets 2 children). Now I launch the singleton instead (nothing changed besides this, still 2+2 granted):
>>
>> "./Mpitest" and get:
>>
>> 27422 ? Sl 4:12 /usr/sge/bin/lx24-x86/sge_execd
>> 9546 ? S 0:00 \_ sge_shepherd-3793 -bg
>> 9548 ? Ss 0:00 \_ /bin/sh /var/spool/sge/pc15370/job_scripts/3793
>> 9549 ? R 0:00 \_ ./Mpitest
>> 9550 ? Ss 0:00 \_ orted --hnp --set-sid --report-uri 6 --singleton-died-pipe 7
>> 9551 ? Sl 0:00 \_ /usr/sge/bin/lx24-x86/qrsh -inherit -nostdin -V pc15381 orted
>> 9554 ? S 0:00 \_ /home/reuti/mpitest/Mpitest --child
>> 9555 ? S 0:00 \_ /home/reuti/mpitest/Mpitest --child
>>
>> 2861 ? Sl 10:47 /usr/sge/bin/lx24-x86/sge_execd
>> 25494 ? Sl 0:00 \_ sge_shepherd-3793 -bg
>> 25495 ? Ss 0:00 \_ /usr/sge/utilbin/lx24-x86/qrsh_starter /var/spool/sge/pc15381/active_jobs/3793.1/1.pc15381
>> 25502 ? S 0:00 \_ orted -mca ess env -mca orte_ess_jobid 814940160 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri
>> 25503 ? S 0:00 \_ /home/reuti/mpitest/Mpitest --child
>>
>> Only one child is going to the other node. The environment is the same in both cases. Is this the correct behavior?
>
>
> We probably aren't correctly marking the original singleton on that node, and so the mapper thinks there are still two slots available on the original node.

Okay. There is something to discuss/fix. BTW: if started as singleton I get an error at the end with the program the OP provided:

[pc15381:25502] [[12435,0],1] routed:binomial: Connection to lifeline [[12435,0],0] lost

It's not the case if run by mpiexec.

-- Reuti