Le 16/02/2012 14:16,
nadia.derbey@bull.net a écrit :
Hi Jeff,
Sorry for the delay, but my victim with 2 ib
devices
had been stolen ;-)
So, I ported the patch on the v1.5 branch and
finally
could test it.
Actually, there is no
opal_hwloc_base_get_topology()
in v1.5 so I had to set
the hwloc flags in ompi_mpi_init() and
orte_odls_base_open()
(i.e. the places
where opal_hwloc_topology is initialized).
With the new flag set,
hwloc_get_nbobjs_by_type(opal_hwloc_topology,
HWLOC_OBJ_CORE)
is now seeing the actual number of cores on the
node
(instead of 1 when our
cpuset is a singleton).
Since opal_paffinity_base_get_processor_info()
calls
module_get_processor_info()
(in hwloc/paffinity_hwloc_module.c), which in
turn
calls hwloc_get_nbobjs_by_type(),
we are now getting the right number of cores in
get_ib_dev_distance().
So we are looping over the exact number of
cores,
looking for a potential binding.
So as a conclusion, there's no need for any
other
patch: the fix you committed
was the only one needed to fix the issue.
I didn't follow this entire thread in details, but I am feeling that
something is wrong here. The flag fixes your problem indeed, but I
think it may break binding too. It's basically making all
"unavailable resources" available. So the binding code may end up
trying to bind processes on cores that it can't actually use.
If srun gives you the first cores of the machine, it works fine
because OMPI tries to use the first cores and those are available.
But did you ever try when srun gives the second socket only for
instance? Or whichever part of the machine that does not contain the
first cores ? I think OMPI will still try to bind on the first cores
if the flag is set, but those are not available for binding.
Unless I am missing something, the proper fix would be to have two
instances of the topology. One with the entire machine (for people
that really want to consult all physical resources), and one for the
really available part of machine (mostly used for binding).