Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] btl/openib: get_ib_dev_distance doesn't see processes as bound if the job has been launched by srun
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2012-02-17 08:59:41


On Feb 17, 2012, at 8:21 AM, Ralph Castain wrote:

>> I didn't follow this entire thread in details, but I am feeling that something is wrong here. The flag fixes your problem indeed, but I think it may break binding too. It's basically making all "unavailable resources" available. So the binding code may end up trying to bind processes on cores that it can't actually use.

I'm not sure I follow here -- which binding code are you referring to; that in hwloc, or that in OMPI?

Ralph and I just talked about this issue on the phone. I think OMPI is currently determining "is this process bound" in an incorrect what.

My understanding of what we should be doing is to compare the output bitmask from hwloc_get_cpubind() with the allowed_cpuset on the HWLOC_OBJ_MACHINE. If where we are bound is less than the allowed cpuset, then the process is bound.

Is that correct?

And per Ralph's question, the allowed_cpuset of HWLOC_OBJ_MACHINE will still be accurate even if we do WHOLE_SYSTEM, right? E.g., if some external agent creates a Linux cpuset for a process, then even if we specify WHOLE_SYSTEM, the allowed_cpuset on OBJ_MACHINE will still accurately reflect the PU's are in the Linux cpuset where this process is running.

Right?

>> If srun gives you the first cores of the machine, it works fine because OMPI tries to use the first cores and those are available. But did you ever try when srun gives the second socket only for instance? Or whichever part of the machine that does not contain the first cores ? I think OMPI will still try to bind on the first cores if the flag is set, but those are not available for binding.

We'll have to check that; I hope that's not right. :-)

>> Unless I am missing something, the proper fix would be to have two instances of the topology. One with the entire machine (for people that really want to consult all physical resources), and one for the really available part of machine (mostly used for binding).

If allowed_cpuset is still accurate with WHOLE_SYSTEM, I hope this won't be necessary (i.e., that everywhere hwloc data is used in OMPI, we obey allowed_cpuset).

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/