Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Spawn_multiple with tight integration to SGE grid engine
From: Reuti (reuti_at_[hidden])
Date: 2012-02-03 18:55:46

Am 04.02.2012 um 00:15 schrieb Tom Bryan:

A more detailed answer later, as it's late here. But one short note:

-pe orte 5 => give me exactly 5 slots

-pe orte 5-5 => the same

-pe orte 5- => give me at least 5 slots, up to the maximum you can get right now in the cluster

The output in `qstat -g t` master/slave only tells you what is granted, not was it necessarly used by you right now. It's up to the application to use the granted slots.


Requesting exactly 5, will show you either "one master and four slaves" or "one master and five slaves". This depends on the setting of "job_is_first_task" in the definition of the PE.

The rationale behind this is, that it will adjust the number of `qrsh -inherit ` calls (just imagine single core machines to understand the idea behind it) which are allowed. In a plain MPI application usually "job_is_first_task" is set to yes, as also the started executable on the machine where the `mpiexec` is issued in the jobscript is doing some work (usually the rank 0). This would result of 4 `qrsh -inherit` being allowed and have a total of 5.

If your rank 0 is for any reason only collecting results and not doing any work (i.e. master/slave application like in PVM), you would like to say "job_is_first_task no". This has the effect, that one additional `qrsh -inherit` is allowed - in detail: a local one plus 4 to other nodes to start 5 slaves.

Nowadays, where you have many cores per node and even use only one `qrsh -inherit` per slave machine and then forks or threads for the additional processes, this setting is less meaningful and would need some new options in the PE:

-- Reuti

> 1. I'm still surprised that the SGE behavior is so different when I
> configure my SGE queue differently. See test "a" in the .tgz. When I just
> run mpitest in and ask for exactly 5 slots (-pe orte 5-5), it works
> if the queue is configured to use a single host. I see 1 MASTER and 4
> SLAVES in qstat -g t, and I get the correct output. If the queue is set to
> use multiple hosts, the jobs hang in spawn/init, and I get errors
> [][[19159,2],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint
> _complete_connect] connect() to failed: Connection refused
> (111)
> [] [[19159,0],3] routed:binomial: Connection to
> lifeline [[19159,0],0] lost
> [] [[19159,0],1] routed:binomial: Connection to
> lifeline [[19159,0],0] lost
> [] [[19159,0],2] routed:binomial: Connection to
> lifeline [[19159,0],0] lost
> So, I'll just assume that mpiexec does some magic that is needed in the
> multi-machine scenario but not in the single machine scenario.
> 2. I guess I'm not sure how SGE is supposed to behave. Experiment "a" and
> "b" were identical except that I changed -pe orte 5-5 to -pe orte 5-. The
> single case works like before, and the multiple exec host case fails as
> before. The difference is that qstat -g t shows additional SLAVEs that
> don't seem to correspond to any jobs on the exec hosts. Are these SLAVEs
> just slots that are reserved for my job but that I'm not using? If my job
> will only use 5 slots, then I should set the SGE qsub job to ask for exactly
> 5 with "-pe orte 5-5", right?
> 3. Experiment "d" was similar to "b", but I use uses "mpiexec -np 1
> mpitest" instead of running mpitest directly. Now both the single machine
> queue and multiple machine queue work. So, mpiexec seems to make my
> multi-machine configuration happier. In this case, I'm still using "-pe
> orte 5-", and I'm still seeing the extra SLAVE slots granted in qstat -g t.
> 4. Based on "d", I thought that I could follow the approach in "a". That
> is, for experiment "e", I used mpiexec -np 1, but I also used -pe orte 5-5.
> I thought that this would make the multi-machine queue reserve only the 5
> slots that I needed. The single machine queue works correctly, but now the
> multi-machine case hangs with no errors. The output from qstat and pstree
> are what I'd expect, but it seems to hang in Span_multiple and Init_thread.
> I really expected this to work.
> I'm really confused by experiment "e" with multiple machines in the queue.
> Based on "a" and "d", I thought that a combination of mpiexec -np 1 would
> permit the multi-machine scheduling to work with MPI while the "-pe orte
> 5-5" would limit the slots to exactly the number that it needed to run.
> ---Tom
> <mpiExperiments.tgz>_______________________________________________
> users mailing list
> users_at_[hidden]