Le 16/02/2012 14:16, firstname.lastname@example.org a écrit :
Sorry for the delay, but my victim with 2 ib
had been stolen ;-)
So, I ported the patch on the v1.5 branch and
could test it.
Actually, there is no
in v1.5 so I had to set
the hwloc flags in ompi_mpi_init() and
(i.e. the places
where opal_hwloc_topology is initialized).
With the new flag set,
is now seeing the actual number of cores on the
(instead of 1 when our
cpuset is a singleton).
(in hwloc/paffinity_hwloc_module.c), which in
we are now getting the right number of cores in
So we are looping over the exact number of
looking for a potential binding.
So as a conclusion, there's no need for any
patch: the fix you committed
was the only one needed to fix the issue.
I didn't follow this entire thread in details, but I am feeling that
something is wrong here. The flag fixes your problem indeed, but I
think it may break binding too. It's basically making all
"unavailable resources" available. So the binding code may end up
trying to bind processes on cores that it can't actually use.
If srun gives you the first cores of the machine, it works fine
because OMPI tries to use the first cores and those are available.
But did you ever try when srun gives the second socket only for
instance? Or whichever part of the machine that does not contain the
first cores ? I think OMPI will still try to bind on the first cores
if the flag is set, but those are not available for binding.
Unless I am missing something, the proper fix would be to have two
instances of the topology. One with the entire machine (for people
that really want to consult all physical resources), and one for the
really available part of machine (mostly used for binding).