On Fri, Feb 17, 2012 at 8:47 AM, Brice Goglin <Brice.Goglin@inria.fr> wrote:
Le 17/02/2012 14:59, Jeff Squyres a écrit :
> On Feb 17, 2012, at 8:21 AM, Ralph Castain wrote:
>
>>> I didn't follow this entire thread in details, but I am feeling that something is wrong here. The flag fixes your problem indeed, but I think it may break binding too. It's basically making all "unavailable resources" available. So the binding code may end up trying to bind processes on cores that it can't actually use.
> I'm not sure I follow here -- which binding code are you referring to; that in hwloc, or that in OMPI?
> My understanding of what we should be doing is to compare the output bitmask from hwloc_get_cpubind() with the allowed_cpuset on the HWLOC_OBJ_MACHINE.  If where we are bound is less than the allowed cpuset, then the process is bound.
>
> Is that correct?

Yes.
I didn't know you already used allowed_cpuset instead of cpuset, good to
know.

> And per Ralph's question, the allowed_cpuset of HWLOC_OBJ_MACHINE will still be accurate even if we do WHOLE_SYSTEM, right?

Yes.

>   E.g., if some external agent creates a Linux cpuset for a process, then even if we specify WHOLE_SYSTEM, the allowed_cpuset on OBJ_MACHINE will still accurately reflect the PU's are in the Linux cpuset where this process is running.

Yes.


But you're talking about "am I bound?" here. My concern was "how does
OMPI bind processes?".
If WHOLE_SYSTEM is passed, you may get more objects in your topology
(most objects with allowed_cpuset=0 are removed when the flag is not
set). So things like get_nbobjs_by_type() return larger values when you
pass the flag. So you can't rely of those values when distributing the
processes among the available cores for instance. Does the OMPI binding
code handle this?

Yes, we do - because we also allow a user to specify a restricted cpuset for us to use, I automatically filter all cpusets at the beginning of time to create an "available" set for our internal use. This is the set I scan when looking at the number of objects available to us.

Of course, if a developer doesn't use our internal utilities to get those numbers, they could do something wrong. :-)

All that said, I think using the WHOLE_SYSTEM flag is actually incorrect. We don't need to do that as the problem Nadia identified is better solved by correcting the current logic. I'm working on that now - unfortunately, the only slurm machine I can access doesn't have slurm's affinity module activated.
 

Brice

_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel