I'm wondering what value there is in showing the full topology, or using it in any of our components, if the process is restricted to a specific set of cpus? Does it really help to know that there are other cpus out there that are unreachable?
firstname.lastname@example.org wrote on 02/09/2012 12:20:41
> De : Brice Goglin <Brice.Goglin@inria.fr>
> A : Open MPI Developers <email@example.com>
> Date : 02/09/2012 12:20 PM
> Objet : Re: [OMPI devel] btl/openib: get_ib_dev_distance
> processes as bound if the job has been launched by srun
> Envoyé par : firstname.lastname@example.org
> By default, hwloc only shows what's inside the current cpuset. There's
> an option to show everything instead (topology flag).
So may be using that flag inside opal_paffinity_base_get_processor_info()
would be a better fix than the one I'm proposing in my patch.
I found a bunch of other places where things are managed
as in get_ib_dev_distance().
Just doing a grep in the sources, I could find:
. init_maffinity() in btl/sm/btl_sm.c
. vader_init_maffinity() in btl/vader/btl_vader.c
. get_ib_dev_distance() in btl/wv/btl_wv_component.c
So I think the flag Brice is talking about should
definitely be the fix.
> Le 09/02/2012 12:18, Jeff Squyres a écrit :
> > Just so that I understand this better -- if a process is bound
> a cpuset, will tools like hwloc's lstopo only show the Linux
> processors *in that cpuset*? I.e., does it not have any visibility
> of the processors outside of its cpuset?
> > On Jan 27, 2012, at 11:38 AM, nadia.derbey wrote:
> >> Hi,
> >> If a job is launched using "srun --resv-ports --cpu_bind:..."
> >> is configured with:
> >> TaskPlugin=task/affinity
> >> TaskPluginParam=Cpusets
> >> each rank of that job is in a cpuset that contains a single
> >> Now, if we use carto on top of this, the following happens
> >> get_ib_dev_distance() (in btl/openib/btl_openib_component.c):
> >> . opal_paffinity_base_get_processor_info() is called
to get the
> >> number of logical processors (we get 1 due
to the singleton cpuset)
> >> . we loop over that # of processors to check whether
our process is
> >> bound to one of them. In our case the loop
will be executed only
> >> once and we will never get the correct binding
> >> . if the process is bound actually get the distance
to the device.
> >> in our case we won't execute that part of the
> >> The attached patch is a proposal to fix the issue.
> >> Regards,
> >> Nadia
> >> <get_ib_dev_distance.patch>_______________________________________________
> >> devel mailing list
> >> email@example.com
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> devel mailing list
devel mailing list