Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] btl/openib: get_ib_dev_distance doesn't see processes as bound if the job has been launched by srun
From: nadia.derbey (Nadia.Derbey_at_[hidden])
Date: 2012-01-27 11:38:34


Hi,

If a job is launched using "srun --resv-ports --cpu_bind:..." and slurm
is configured with:
   TaskPlugin=task/affinity
   TaskPluginParam=Cpusets

each rank of that job is in a cpuset that contains a single CPU.

Now, if we use carto on top of this, the following happens in
get_ib_dev_distance() (in btl/openib/btl_openib_component.c):
   . opal_paffinity_base_get_processor_info() is called to get the
     number of logical processors (we get 1 due to the singleton cpuset)
   . we loop over that # of processors to check whether our process is
     bound to one of them. In our case the loop will be executed only
     once and we will never get the correct binding information.
   . if the process is bound actually get the distance to the device.
     in our case we won't execute that part of the code.

The attached patch is a proposal to fix the issue.

Regards,
Nadia