Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] btl/openib: get_ib_dev_distance doesn't see processes as bound if the job has been launched by srun
From: nadia.derbey_at_[hidden]
Date: 2012-02-06 10:28:17


Resending, as i didn't get any answer...

Regards,
Nadia
 

-- 
Nadia Derbey
 
devel-bounces_at_[hidden] wrote on 01/27/2012 05:38:34 PM:
> De : "nadia.derbey" <Nadia.Derbey_at_[hidden]>
> A : Open MPI Developers <devel_at_[hidden]>
> Date : 01/27/2012 05:35 PM
> Objet : [OMPI devel] btl/openib: get_ib_dev_distance doesn't see 
> processes as bound if the job has been launched by srun
> Envoyé par : devel-bounces_at_[hidden]
> 
> Hi,
> 
> If a job is launched using "srun --resv-ports --cpu_bind:..." and slurm
> is configured with:
>    TaskPlugin=task/affinity
>    TaskPluginParam=Cpusets
> 
> each rank of that job is in a cpuset that contains a single CPU.
> 
> Now, if we use carto on top of this, the following happens in
> get_ib_dev_distance() (in btl/openib/btl_openib_component.c):
>    . opal_paffinity_base_get_processor_info() is called to get the
>      number of logical processors (we get 1 due to the singleton cpuset)
>    . we loop over that # of processors to check whether our process is
>      bound to one of them. In our case the loop will be executed only
>      once and we will never get the correct binding information.
>    . if the process is bound actually get the distance to the device.
>      in our case we won't execute that part of the code.
> 
> The attached patch is a proposal to fix the issue.
> 
> Regards,
> Nadia
> [attachment "get_ib_dev_distance.patch" deleted by Nadia Derbey/FR/
> BULL] _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel