Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Spawn_multiple with tight integration to SGE grid engine
From: Tom Bryan (tombry_at_[hidden])
Date: 2012-02-03 18:15:26


OK. Sorry for the delay. I needed to read through this thread a few times
and try some experiments. Let me reply to a few of these pieces, and then
I'll talk about those experiments.

On 1/31/12 9:26 AM, "Reuti" <reuti_at_[hidden]> wrote:

>>> I never used spawn_mutiple, but isn't it necessary to start it with mpiexec
>>> too and call MPI_Init?
>>>
>>> $ mpiexec ./mpitest -np 1
>>
>> I don't think so.
>
> In the book "Using MPI-2 by William Gropp at el." they use it in chapter
> 7.2.2/page 235 this way, although it's indeed stated in the MPI-2.2 standard
> on page 329 to create a singleton MPI environment if the application could
> find the necessary information (i.e. wasn't started by mpiexec).
>
> Maybe it's a side effect of a tight integration that it would start on the
> correct nodes (but I face an incorrect allocation of slots and an error
> message at the end if started without mpiexec), as in this case it has no
> command line option for the hostfile. How to get the requested nodes if
> started from the command line?

OK. I misunderstood you. I thought that you were saying that spawn_multiple
had to call mpiexec for each spawned process. If you just meant that mpi.sh
should launch the initial process with mpiexec, that seems reasonable. I
tried it with and without, and I definitely get better results when using
mpiexec.

>> In any case, when I restrict the SGE grid to run all of
>> my orte parallel environment jobs on one machine, the application runs fine.
>> I only have problems if one or more of the spawned children gets scheduled
>> to another node.

>>> to override the detected slots by the tight integration into SGE. Otherwise
>>> it might be running only as a serial one. The additional 4 spawned
>>> processes can then be added inside your application.
>>>
>>> The line to initialize MPI:
>>>
>>> if( MPI::Init( MPI::THREAD_MULTIPLE ) != MPI::THREAD_MULTIPLE )
>>> ...
>>>
>>> I replaced the complete if... by a plain MPI::Init(); and get a suitable
>>> output (see attached, qsub -pe openmpi 4 and changed _nProc to 3) in a tight
>>> integration into SGE.
>
> Okay, typo - the _thread is missing.

I have not tried that change, yet.

If I need MPI_THREAD_MULTIPLE, and openmpi is compiled with thread support,
it's not clear to me whether MPI::Init_Thread() and
MPI::Inint_Thread(MPI::THREAD_MULTIPLE) would give me the same behavior from
Open MPI.

>>> NB: What is MPI::Init( MPI::THREAD_MULTIPLE ) supposed to do, output a
>>> feature of MPI?

>From the man page:
MPI_Init_thread, as compared to MPI_Init, has a provision to request a
certain level of thread support in required....The level of thread support
available to the program is set in provided, except in C++, where it is the
return value of the function.

> For me it's not hanging. Did you try the alternative startup using mpiexec?
> Aha - BTW: I use 1.4.4

Right, I'm on 1.5.4.

Yes, I did try starting with mpiexec. That helps, but I still don't know
whether I understand all of the results.

For each experiment, I've attached the output of
qfstat -f
qfstat -g t
pstree -Aalp <pid of sge_execd>
output of mpitest parent and children (mpi.sh.o<job>)

I ran each test with two different SGE queue configurations. In one case,
the queue with the orte pe is set to include all 5 exec hosts in my gird.
In the "single" case, the queue with the orte pe is set to use only a single
host. (The queue configuration isn't shown here, but I changed the queue's
hostlist to user either a single host or a host group that includes all of
my machines.

I run qsub on node 17. The grid machines available for this run are 3, 4,
10, 11, and 16.

Some observations:

1. I'm still surprised that the SGE behavior is so different when I
configure my SGE queue differently. See test "a" in the .tgz. When I just
run mpitest in mpi.sh and ask for exactly 5 slots (-pe orte 5-5), it works
if the queue is configured to use a single host. I see 1 MASTER and 4
SLAVES in qstat -g t, and I get the correct output. If the queue is set to
use multiple hosts, the jobs hang in spawn/init, and I get errors
[grid-03.cisco.com][[19159,2],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint
_complete_connect] connect() to 192.168.122.1 failed: Connection refused
(111)
[grid-10.cisco.com:05327] [[19159,0],3] routed:binomial: Connection to
lifeline [[19159,0],0] lost
[grid-16.cisco.com:25196] [[19159,0],1] routed:binomial: Connection to
lifeline [[19159,0],0] lost
[grid-11.cisco.com:63890] [[19159,0],2] routed:binomial: Connection to
lifeline [[19159,0],0] lost
So, I'll just assume that mpiexec does some magic that is needed in the
multi-machine scenario but not in the single machine scenario.

2. I guess I'm not sure how SGE is supposed to behave. Experiment "a" and
"b" were identical except that I changed -pe orte 5-5 to -pe orte 5-. The
single case works like before, and the multiple exec host case fails as
before. The difference is that qstat -g t shows additional SLAVEs that
don't seem to correspond to any jobs on the exec hosts. Are these SLAVEs
just slots that are reserved for my job but that I'm not using? If my job
will only use 5 slots, then I should set the SGE qsub job to ask for exactly
5 with "-pe orte 5-5", right?

3. Experiment "d" was similar to "b", but I use mpi.sh uses "mpiexec -np 1
mpitest" instead of running mpitest directly. Now both the single machine
queue and multiple machine queue work. So, mpiexec seems to make my
multi-machine configuration happier. In this case, I'm still using "-pe
orte 5-", and I'm still seeing the extra SLAVE slots granted in qstat -g t.

4. Based on "d", I thought that I could follow the approach in "a". That
is, for experiment "e", I used mpiexec -np 1, but I also used -pe orte 5-5.
I thought that this would make the multi-machine queue reserve only the 5
slots that I needed. The single machine queue works correctly, but now the
multi-machine case hangs with no errors. The output from qstat and pstree
are what I'd expect, but it seems to hang in Span_multiple and Init_thread.
I really expected this to work.

I'm really confused by experiment "e" with multiple machines in the queue.
Based on "a" and "d", I thought that a combination of mpiexec -np 1 would
permit the multi-machine scheduling to work with MPI while the "-pe orte
5-5" would limit the slots to exactly the number that it needed to run.

---Tom