Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Bindings not detected with slurm (srun)
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-08-22 09:47:09


Okay - thx! I'll install in trunk and schedule for 1.5

On Aug 22, 2011, at 7:20 AM, pascal.deveze_at_[hidden] wrote:

>
> users-bounces_at_[hidden] a écrit sur 18/08/2011 14:41:25 :
>
>> De : Ralph Castain <rhc_at_[hidden]>
>> A : Open MPI Users <users_at_[hidden]>
>> Date : 18/08/2011 14:45
>> Objet : Re: [OMPI users] Bindings not detected with slurm (srun)
>> Envoyé par : users-bounces_at_[hidden]
>>
>> Afraid I am confused. I assume this refers to the trunk, yes?
>
> I work with V1.5.
>
>>
>> I also assume you are talking about launching an application
>> directly from srun as opposed to using mpirun - yes?
>
> Yes
>
>>
>> In that case, I fail to understand what difference it makes
>> regarding this proposed change. The application process is being
>> directly bound by slurm, so what paffinity thinks is irrelevant,
>> except perhaps for some debugging I suppose. Is that what you are
>> concerned about?
>
> I have a framework that has to check if the processes are bound. This
> framework
> uses the macro OPAL_PAFFINITY_PROCESS_IS_BOUND and really needs that all
> processes are bound.
>
> That runs well except when I use srun with slurm configured to bind
> each single rank with a singleton.
>
> For exemple, I use nodes with 8 sockets of 4 cores. The command srun
> generates 32 cpusets (one for each core) and binds the 32 processes, one
> on each cpuset.
> Then the macro returns *bound=false, and my framework considers that the
> processes are not bound and doesn't do the job correctly.
>
> The patch modifies the macro to return *bound=true when a single
> process is bound to a cpuset of one core.
>
>>
>> I'd just like to know what problem is actually being solved here. I
>> agree that, if there is only one processor in a system, you are
>> effectively "bound".
>>
>>
>> On Aug 18, 2011, at 2:25 AM, pascal.deveze_at_[hidden] wrote:
>>
>>> Hi all,
>>>
>>> When slurm is configured with the following parameters
>>> TaskPlugin=task/affinity
>>> TaskPluginParam=Cpusets
>>> srun binds the processes by placing them into different
>>> cpusets, each containing a single core.
>>>
>>> e.g. "srun -N 2 -n 4" will create 2 cpusets in each of the two
> allocated
>>> nodes and place the four ranks there, each single rank with a singleton
> as
>>> a cpu constraint.
>>>
>>> The issue in that case is in the macro OPAL_PAFFINITY_PROCESS_IS_BOUND
> (in
>>> opal/mca/paffinity/paffinity.h):
>>> . opal_paffinity_base_get_processor_info() fills in num_processors
> with 1
>>> (this is the size of each cpu_set)
>>> . num_bound is set to 1 too
>>> and this implies *bound=false
>>>
>>> So, the binding is correctly done by slurm and not detected by MPI.
>>>
>>> To support the cpuset binding done by slurm, I propose the following
> patch:
>>>
>>> hg diff opal/mca/paffinity/paffinity.h
>>> diff -r 4d8c8a39b06f opal/mca/paffinity/paffinity.h
>>> --- a/opal/mca/paffinity/paffinity.h Thu Apr 21 17:38:00 2011 +0200
>>> +++ b/opal/mca/paffinity/paffinity.h Tue Jul 12 15:44:59 2011 +0200
>>> @@ -218,7 +218,8 @@
>>> num_bound++; \
>>> } \
>>> } \
>>> - if (0 < num_bound && num_bound < num_processors) { \
>>> + if (0 < num_bound && ((num_processors == 1) || \
>>> + (num_bound < num_processors))) { \
>>> *(bound) = true; \
>>> } \
>>> } \
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users