Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] btl/openib: get_ib_dev_distance doesn't see processes as bound if the job has been launched by srun
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-02-09 18:06:45


That's pretty much what I had in mind too - will have to play with it a bit until we find the best solution, but it shouldn't be all that hard.

On Feb 9, 2012, at 2:23 PM, Brice Goglin wrote:

> Here's what I would do:
> During init, walk the list of hwloc PCI devices (hwloc_get_next_pcidev()) and keep an array of pointers to the interesting onces + their locality (the hwloc cpuset of the parent non-IO object).
> When you want the I/O device near a core, walk the array and find one whose locality contains your core hwloc cpuset.
>
> If you need help, feel free to contact me offline.
>
> Brice
>
>
>
> Le 09/02/2012 22:14, Ralph Castain a écrit :
>>
>> Hmmm….guess we'll have to play with it. Our need is to start with a core or some similar object, and quickly determine the closest IO device of a certain type. We wound up having to write "summarizer" code to parse the hwloc tree into a more OMPI-usable form, so we can always do that with the IO tree as well if necessary.
>>
>>
>> On Feb 9, 2012, at 2:09 PM, Brice Goglin wrote:
>>
>>> That doesn't really work with the hwloc model unfortunately. Also, when you get to smaller objects (cores, threads, ...) there are multiple "closest" objects at each depth.
>>>
>>> We have one "closest" object at some depth (usually Machine or NUMA node). If you need something higher, you just walk the parent links. If you need something smaller, you look at children.
>>>
>>> Also, each I/O device isn't directly attached to such a closest object. It's usually attached under some bridge objects. There's a tree of hwloc PCI bus objects exactly like you have a tree of hwloc sockets/cores/threads/etc. At the top of the I/O tree, one (bridge) object is attached to a regular object as explained earlier. So, when you have a random hwloc PCI object, you get its locality by walking up its parent link until you find a non-I/O object (one whose cpuset isn't NULL). hwloc/helper.h gives you hwloc_get_non_io_ancestor_obj() to do that.
>>>
>>> Brice
>>>
>>>
>>>
>>> Le 09/02/2012 14:34, Ralph Castain a écrit :
>>>>
>>>> Ah, okay - in that case, having the I/O device attached to the "closest" object at each depth would be ideal from an OMPI perspective.
>>>>
>>>> On Feb 9, 2012, at 6:30 AM, Brice Goglin wrote:
>>>>
>>>>> The bios usually tells you which numa location is close to each host-to-pci bridge. So the answer is yes.
>>>>> Brice
>>>>>
>>>>>
>>>>> Ralph Castain <rhc_at_[hidden]> a écrit :
>>>>> I'm not sure I understand this comment. A PCI device is attached to the node, not to any specific location within the node, isn't it? Can you really say that a PCI device is "attached" to a specific NUMA location, for example?
>>>>>
>>>>>
>>>>> On Feb 9, 2012, at 6:15 AM, Jeff Squyres wrote:
>>>>>
>>>>>> That doesn't seem too attractive from an OMPI perspective, though. We'd want to know where the PCI devices are actually rooted.
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel