Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts
From: Ralph Castain (rhc_at_[hidden])
Date: 2010-11-16 09:40:21


Hi Reuti

> > 2. have SGE bind procs it launches to -all- of those cores. I believe SGE
> does this automatically to constrain the procs to running on only those
> cores.
>
> This is another "bug/feature" in SGE: it's a matter of discussion, whether
> the shepherd should get exactly one core (in case you use more than one
> `qrsh`per node) for each call, or *all* cores assigned (which we need right
> now, as the processes in Open MPI will be forks of orte daemon). About such
> a situtation I filled an issue a long time ago and
> "limit_to_one_qrsh_per_host yes/no" in the PE definition would do (this
> setting should then also change the core allocation of the master process):
>
> http://gridengine.sunsource.net/issues/show_bug.cgi?id=1254

I believe this is indeed the crux of the issue

>
>
>
> > 3. tell OMPI to --bind-to-core.
> >
> > In other words, tell SGE to allocate a certain number of cores on each
> node, but to bind each proc to all of them (i.e., don't bind a proc to a
> specific core). I'm pretty sure that is a standard SGE option today (at
> least, I know it used to be). I don't believe any patch or devel work is
> required (to either SGE or OMPI).
>
> When you use a fixed allocation_rule and a matching -binding request it
> will work today. But any other case won't be distributed in the correct way.
>

Is it possible to not include the -binding request? If SGE is told to use a
fixed allocation_rule, and to allocate (for example) 2 cores/node, then
won't the orted see itself bound to two specific cores on each node? We
would then be okay as the spawned children of orted would inherit its
binding. Just don't tell mpirun to bind the processes and the threads of
those MPI procs will be able to operate across the provided cores.

Or does SGE only allocate 2 cores/node in that case (i.e., allocate, but no
-binding given), but doesn't bind the orted to any two specific cores? If
so, then that would be a problem as the orted would think itself
unconstrained. If I understand the thread correctly, you're saying that this
is what happens today - true?

>
> -- Reuti
>
>
> >
> >
> > On Tue, Nov 16, 2010 at 4:07 AM, Reuti <reuti_at_[hidden]>
> wrote:
> > Am 16.11.2010 um 10:26 schrieb Chris Jewell:
> >
> > > Hi all,
> > >
> > >> On 11/15/2010 02:11 PM, Reuti wrote:
> > >>> Just to give my understanding of the problem:
> > >>>>
> > >>>>>> Sorry, I am still trying to grok all your email as what the
> problem you
> > >>>>>> are trying to solve. So is the issue is trying to have two jobs
> having
> > >>>>>> processes on the same node be able to bind there processes on
> different
> > >>>>>> resources. Like core 1 for the first job and core 2 and 3 for the
> 2nd job?
> > >>>>>>
> > >>>>>> --td
> > >> You can't get 2 slots on a machine, as it's limited by the core count
> to one here, so such a slot allocation shouldn't occur at all.
> > >
> > > So to clarify, the current -binding <binding_strategy>:<binding_amount>
> allocates binding_amount cores to each sge_shepherd process associated with
> a job_id. There appears to be only one sge_shepherd process per job_id per
> execution node, so all child processes run on these allocated cores. This
> is irrespective of the number of slots allocated to the node.
> > >
> > > I agree with Reuti that the binding_amount parameter should be a
> maximum number of bound cores per node, with the actual number determined by
> the number of slots allocated per node. FWIW, an alternative approach might
> be to have another binding_type ('slot', say) that automatically allocated
> one core per slot.
> > >
> > > Of course, a complex situation might arise if a user submits a combined
> MPI/multithreaded job, but then I guess we're into the realm of setting
> allocation_rule.
> >
> > IIRC there was a discussion on the [GE users] list about it, to get an
> uniform distribution on all slave nodes for such jobs, as also e.g.
> $OMP_NUM_THREADS will be set to the same value for all slave nodes for
> hybrid jobs. Otherwise it would be necessary to adjust SGE to set this value
> in the "-builtin-" startup method automatically on all nodes to the local
> granted slots value. For now a fixed allocation rule of 1,2,4 or whatever
> must be used and you have to submit by reqeusting a wildcard PE to get any
> of these defined PEs for an even distribution and you don't care whether
> it's two times two slots, one time four slots, or four times one slot.
> >
> > In my understanding, any type of parallel job should always request and
> get the total number of slots equal to the cores it needs to execute.
> Independent whether these are threads, forks or any hybrid type of jobs.
> Otherwise any resource planing and reservation will most likely fail.
> Nevertheless, there might exist rare cases where you submit an exclusive
> serial job but create threads/forks in the end. But such a setup should be
> an exception, not the default.
> >
> >
> > > Is it going to be worth looking at creating a patch for this?
> >
> > Absolute.
> >
> >
> > > I don't know much of the internals of SGE -- would it be hard work to
> do? I've not that much time to dedicate towards it, but I could put some
> effort in if necessary...
> >
> > I don't know about the exact coding for it, but when it's for now a plain
> "copy" of the binding list, then it should become a loop to create a list of
> cores from the original specification until all granted slots got a core
> allocated.
> >
> > -- Reuti
> >
> >
> > >
> > > Chris
> > >
> > >
> > > --
> > > Dr Chris Jewell
> > > Department of Statistics
> > > University of Warwick
> > > Coventry
> > > CV4 7AL
> > > UK
> > > Tel: +44 (0)24 7615 0778
> > >
> > >
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>