Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] btl/openib: get_ib_dev_distance doesn't see processes as bound if the job has been launched by srun
From: Brice Goglin (Brice.Goglin_at_[hidden])
Date: 2012-02-17 02:36:54


Le 16/02/2012 14:16, nadia.derbey_at_[hidden] a écrit :
> Hi Jeff,
>
> Sorry for the delay, but my victim with 2 ib devices had been stolen ;-)
>
> So, I ported the patch on the v1.5 branch and finally could test it.
>
> Actually, there is no opal_hwloc_base_get_topology() in v1.5 so I had
> to set
> the hwloc flags in ompi_mpi_init() and orte_odls_base_open() (i.e. the
> places
> where opal_hwloc_topology is initialized).
>
> With the new flag set, hwloc_get_nbobjs_by_type(opal_hwloc_topology,
> HWLOC_OBJ_CORE)
> is now seeing the actual number of cores on the node (instead of 1
> when our
> cpuset is a singleton).
>
> Since opal_paffinity_base_get_processor_info() calls
> module_get_processor_info()
> (in hwloc/paffinity_hwloc_module.c), which in turn calls
> hwloc_get_nbobjs_by_type(),
> we are now getting the right number of cores in get_ib_dev_distance().
>
> So we are looping over the exact number of cores, looking for a
> potential binding.
>
> So as a conclusion, there's no need for any other patch: the fix you
> committed
> was the only one needed to fix the issue.

I didn't follow this entire thread in details, but I am feeling that
something is wrong here. The flag fixes your problem indeed, but I think
it may break binding too. It's basically making all "unavailable
resources" available. So the binding code may end up trying to bind
processes on cores that it can't actually use.

If srun gives you the first cores of the machine, it works fine because
OMPI tries to use the first cores and those are available. But did you
ever try when srun gives the second socket only for instance? Or
whichever part of the machine that does not contain the first cores ? I
think OMPI will still try to bind on the first cores if the flag is set,
but those are not available for binding.

Unless I am missing something, the proper fix would be to have two
instances of the topology. One with the entire machine (for people that
really want to consult all physical resources), and one for the really
available part of machine (mostly used for binding).

Brice