Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Pak Lui (Pak.Lui_at_[hidden])
Date: 2007-06-25 09:52:52


Bogdan Costescu wrote:
> On Mon, 25 Jun 2007, sadfub_at_[hidden] wrote:
>
>> I *assume* loose coupled jobs
>
> Hmm, given Sun's supposed involvement in this project, I'm really
> surprised that there is nobody from Sun to explain this.

So do I! :) I have my hands wrapped around for playing too much
basketball, I couldn't really type for a few days.

>
> I don't use SGE anymore, but some years ago when I did I have worked
> on the integration of LAM/MPI; here's what I remember:
>
> - loose integration: the batch job is given a file which contains the
> list of nodes and number of slots (=processes that can be run on
> each node). The scheduler knows that the resources are ocupied until
> the batch job finishes. SGE has no involvement in starting of the
> processes on remote nodes, the job should do everything by itself
> (f.e. by using rsh/ssh). The end of the batch script or maybe an
> early termination (f.e. for exceeded runtime) tells SGE that the job
> has ended and there is no effort from SGE to finish processes
> launched on remote nodes. Removing a running job means that signals
> are sent only to the process on the main node of the job; the job
> should take care by itself of propagating signals or cleaning up on
> remote nodes.
>
> - tight integration: the batch job is given the same nodes file, but
> SGE expects the job to use SGE's own launch mechanism, which is
> based on NetBSD's rsh [1]. The SGE daemons on remote nodes then know
> about the processes that belong to the job and there is a SGE rsh
> connection allowed for each slot allocated to the job on that node.
> Upon termination of the job, SGE tries to kill all processes that
> belong to the job on all allocated nodes. To track the processes
> that belong to a job on a node, the daemon uses a pool of group IDs
> that are normally not used and then sets an additional group ID
> (setgroups(2)) on the the launched process(es) - this call is
> available only to 'root', so there is no way for user processes to
> escape (like creating a separate process group, etc.) and upon
> termination of the job all processes (included spawned ones) that
> are marked with the job-specific group are killed.
>

I've came across DanT's blog that does some good explanation for the
difference between the two.
http://blogs.sun.com/templedf/entry/pe_tight_integration

> [1] There is currently some effort on integrating ssh as well, the
> problem being that the ssh daemon needs some modifications to allow
> SGE to obtain accounting information. There was also some talk about a
> TM-like API; unfortunately the progress in this area seems to be very
> slow, if there is any at all...
>

We are still pushing for the TM-like API or DRMAA for SGE since the
current state of rsh socket limitation imposes a limitation on the
number of nodes we can launch at one time. A workaround is to use the
SSH integration but it is only available when you get SGE 6.1 from
source, and not binaries.

I've heard that the SGE team is working on making rsh/rshd dispensable,
so the modification for rshd/sshd will not be needed. Without leaking
too much details (since I don't believe they have announced it yet),
this should help speedup the start time and also also solve the
privileged socket limitation for launching parallel jobs. It will be in
the upcoming release.

-- 
- Pak Lui
pak.lui_at_[hidden]