Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] btl/openib: get_ib_dev_distance doesn't see processes as bound if the job has been launched by srun
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-02-17 11:54:34

On Fri, Feb 17, 2012 at 8:47 AM, Brice Goglin <Brice.Goglin_at_[hidden]> wrote:

> Le 17/02/2012 14:59, Jeff Squyres a écrit :
> > On Feb 17, 2012, at 8:21 AM, Ralph Castain wrote:
> >
> >>> I didn't follow this entire thread in details, but I am feeling that
> something is wrong here. The flag fixes your problem indeed, but I think it
> may break binding too. It's basically making all "unavailable resources"
> available. So the binding code may end up trying to bind processes on cores
> that it can't actually use.
> > I'm not sure I follow here -- which binding code are you referring to;
> that in hwloc, or that in OMPI?
> > My understanding of what we should be doing is to compare the output
> bitmask from hwloc_get_cpubind() with the allowed_cpuset on the
> HWLOC_OBJ_MACHINE. If where we are bound is less than the allowed cpuset,
> then the process is bound.
> >
> > Is that correct?
> Yes.
> I didn't know you already used allowed_cpuset instead of cpuset, good to
> know.
> > And per Ralph's question, the allowed_cpuset of HWLOC_OBJ_MACHINE will
> still be accurate even if we do WHOLE_SYSTEM, right?
> Yes.
> > E.g., if some external agent creates a Linux cpuset for a process,
> then even if we specify WHOLE_SYSTEM, the allowed_cpuset on OBJ_MACHINE
> will still accurately reflect the PU's are in the Linux cpuset where this
> process is running.
> Yes.
> But you're talking about "am I bound?" here. My concern was "how does
> OMPI bind processes?".
> If WHOLE_SYSTEM is passed, you may get more objects in your topology
> (most objects with allowed_cpuset=0 are removed when the flag is not
> set). So things like get_nbobjs_by_type() return larger values when you
> pass the flag. So you can't rely of those values when distributing the
> processes among the available cores for instance. Does the OMPI binding
> code handle this?

Yes, we do - because we also allow a user to specify a restricted cpuset
for us to use, I automatically filter all cpusets at the beginning of time
to create an "available" set for our internal use. This is the set I scan
when looking at the number of objects available to us.

Of course, if a developer doesn't use our internal utilities to get those
numbers, they could do something wrong. :-)

All that said, I think using the WHOLE_SYSTEM flag is actually incorrect.
We don't need to do that as the problem Nadia identified is better solved
by correcting the current logic. I'm working on that now - unfortunately,
the only slurm machine I can access doesn't have slurm's affinity module

> Brice
> _______________________________________________
> devel mailing list
> devel_at_[hidden]