Le 16/02/2012 14:16, nadia.derbey@bull.net a écrit :
Hi Jeff,

Sorry for the delay, but my victim with 2 ib devices had been stolen ;-)

So, I ported the patch on the v1.5 branch and finally could test it.

Actually, there is no opal_hwloc_base_get_topology() in v1.5 so I had to set
the hwloc flags in ompi_mpi_init() and orte_odls_base_open() (i.e. the places
where opal_hwloc_topology is initialized).

With the new flag set, hwloc_get_nbobjs_by_type(opal_hwloc_topology, HWLOC_OBJ_CORE)
is now seeing the actual number of cores on the node (instead of 1 when our
cpuset is a singleton).

Since opal_paffinity_base_get_processor_info() calls module_get_processor_info()
(in hwloc/paffinity_hwloc_module.c), which in turn calls hwloc_get_nbobjs_by_type(),
we are now getting the right number of cores in get_ib_dev_distance().

So we are looping over the exact number of cores, looking for a potential binding.

So as a conclusion, there's no need for any other patch: the fix you committed
was the only one needed to fix the issue.

I didn't follow this entire thread in details, but I am feeling that something is wrong here. The flag fixes your problem indeed, but I think it may break binding too. It's basically making all "unavailable resources" available. So the binding code may end up trying to bind processes on cores that it can't actually use.

If srun gives you the first cores of the machine, it works fine because OMPI tries to use the first cores and those are available. But did you ever try when srun gives the second socket only for instance? Or whichever part of the machine that does not contain the first cores ? I think OMPI will still try to bind on the first cores if the flag is set, but those are not available for binding.

Unless I am missing something, the proper fix would be to have two instances of the topology. One with the entire machine (for people that really want to consult all physical resources), and one for the really available part of machine (mostly used for binding).