Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] btl/openib: get_ib_dev_distance doesn't see processes as bound if the job has been launched by srun
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-02-17 08:21:11


On Thu, Feb 16, 2012 at 11:36 PM, Brice Goglin <Brice.Goglin_at_[hidden]>wrote:

> **
> Le 16/02/2012 14:16, nadia.derbey_at_[hidden] a écrit :
>
> Hi Jeff,
>
> Sorry for the delay, but my victim with 2 ib devices had been stolen ;-)
>
> So, I ported the patch on the v1.5 branch and finally could test it.
>
> Actually, there is no opal_hwloc_base_get_topology() in v1.5 so I had to
> set
> the hwloc flags in ompi_mpi_init() and orte_odls_base_open() (i.e. the
> places
> where opal_hwloc_topology is initialized).
>
> With the new flag set, hwloc_get_nbobjs_by_type(opal_hwloc_topology,
> HWLOC_OBJ_CORE)
> is now seeing the actual number of cores on the node (instead of 1 when our
> cpuset is a singleton).
>
> Since opal_paffinity_base_get_processor_info() calls
> module_get_processor_info()
> (in hwloc/paffinity_hwloc_module.c), which in turn calls
> hwloc_get_nbobjs_by_type(),
> we are now getting the right number of cores in get_ib_dev_distance().
>
> So we are looping over the exact number of cores, looking for a potential
> binding.
>
> So as a conclusion, there's no need for any other patch: the fix you
> committed
> was the only one needed to fix the issue.
>
>
> I didn't follow this entire thread in details, but I am feeling that
> something is wrong here. The flag fixes your problem indeed, but I think it
> may break binding too. It's basically making all "unavailable resources"
> available. So the binding code may end up trying to bind processes on cores
> that it can't actually use.
>
> If srun gives you the first cores of the machine, it works fine because
> OMPI tries to use the first cores and those are available. But did you ever
> try when srun gives the second socket only for instance? Or whichever part
> of the machine that does not contain the first cores ? I think OMPI will
> still try to bind on the first cores if the flag is set, but those are not
> available for binding.
>
> Unless I am missing something, the proper fix would be to have two
> instances of the topology. One with the entire machine (for people that
> really want to consult all physical resources), and one for the really
> available part of machine (mostly used for binding).
>

Hmmm...are you saying that the "allowed" cpuset no longer is accurate when
this flag is set? That would seem strange. If so, can we ask that hwloc
instead show the resources, but correctly reflect the allowed cpuset? In
other words, give us a flag so that hwloc topology includes resources that
have zero bits in the allowed cpuset?

> Brice
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>