Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] btl/openib: get_ib_dev_distance doesn't see processes as bound if the job has been launched by srun
From: nadia.derbey_at_[hidden]
Date: 2012-02-17 06:18:26


devel-bounces_at_[hidden] wrote on 02/17/2012 08:36:54 AM:

> De : Brice Goglin <Brice.Goglin_at_[hidden]>
> A : devel_at_[hidden]
> Date : 02/17/2012 08:37 AM
> Objet : Re: [OMPI devel] btl/openib: get_ib_dev_distance doesn't see
> processes as bound if the job has been launched by srun
> Envoyé par : devel-bounces_at_[hidden]
>
> Le 16/02/2012 14:16, nadia.derbey_at_[hidden] a écrit :
> Hi Jeff,
>
> Sorry for the delay, but my victim with 2 ib devices had been stolen ;-)

>
> So, I ported the patch on the v1.5 branch and finally could test it.
>
> Actually, there is no opal_hwloc_base_get_topology() in v1.5 so I had to
set
> the hwloc flags in ompi_mpi_init() and orte_odls_base_open() (i.e. the
places
> where opal_hwloc_topology is initialized).
>
> With the new flag set, hwloc_get_nbobjs_by_type(opal_hwloc_topology,
> HWLOC_OBJ_CORE)
> is now seeing the actual number of cores on the node (instead of 1 when
our
> cpuset is a singleton).
>
> Since opal_paffinity_base_get_processor_info() calls
> module_get_processor_info()
> (in hwloc/paffinity_hwloc_module.c), which in turn calls
> hwloc_get_nbobjs_by_type(),
> we are now getting the right number of cores in get_ib_dev_distance().
>
> So we are looping over the exact number of cores, looking for a
> potential binding.
>
> So as a conclusion, there's no need for any other patch: the fix
youcommitted
> was the only one needed to fix the issue.
>
> I didn't follow this entire thread in details, but I am feeling that
> something is wrong here. The flag fixes your problem indeed, but I
> think it may break binding too. It's basically making all
> "unavailable resources" available. So the binding code may end up
> trying to bind processes on cores that it can't actually use.

It's true that if we have a resource manager that can allocate for us
say a single socket within a node, the binding part OMPI might go out
of its actual boundaries.

>
> If srun gives you the first cores of the machine, it works fine
> because OMPI tries to use the first cores and those are available.
> But did you ever try when srun gives the second socket only for
> instance? Or whichever part of the machine that does not contain the
> first cores ?

But I have to look for the proper option in slurm: I don't know if slurm
allows for such a fine grained allocation. I have to look for the option
that enables to allocate socket X (X!=0).

> I think OMPI will still try to bind on the first cores
> if the flag is set, but those are not available for binding.
>
> Unless I am missing something, the proper fix would be to have two
> instances of the topology. One with the entire machine (for people
> that really want to consult all physical resources), and one for the
> really available part of machine (mostly used for binding).

Agreed!

Regards,
Nadia
>
> Brice
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel