Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] affinity issues under cpuset torque 1.8.1
From: Brock Palen (brockp_at_[hidden])
Date: 2014-06-20 12:15:56


In this case they are a single socket, but as you can see they could be ether/or depending on the job.

Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
XSEDE Campus Champion
brockp_at_[hidden]
(734)936-1985

On Jun 19, 2014, at 2:44 PM, Ralph Castain <rhc_at_[hidden]> wrote:

> Sorry, I should have been clearer - I was asking if cores 8-11 are all on one socket, or span multiple sockets
>
>
> On Jun 19, 2014, at 11:36 AM, Brock Palen <brockp_at_[hidden]> wrote:
>
>> Ralph,
>>
>> It was a large job spread across. Our system allows users to ask for 'procs' which are laid out in any format.
>>
>> The list:
>>
>>> [nyx5406:2][nyx5427:2][nyx5506:2][nyx5311:3]
>>> [nyx5329:4][nyx5398:4][nyx5396:11][nyx5397:11]
>>> [nyx5409:11][nyx5411:11][nyx5412:3]
>>
>> Shows that nyx5406 had 2 cores, nyx5427 also 2, nyx5411 had 11.
>>
>> They could be spread across any number of sockets configuration. We start very lax "user requests X procs" and then the user can request more strict requirements from there. We support mostly serial users, and users can colocate on nodes.
>>
>> That is good to know, I think we would want to turn our default to 'bind to core' except for our few users who use hybrid mode.
>>
>> Our CPU set tells you what cores the job is assigned. So in the problem case provided, the cpuset/cgroup shows only cores 8-11 are available to this job on this node.
>>
>> Brock Palen
>> www.umich.edu/~brockp
>> CAEN Advanced Computing
>> XSEDE Campus Champion
>> brockp_at_[hidden]
>> (734)936-1985
>>
>>
>>
>> On Jun 18, 2014, at 11:10 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>
>>> The default binding option depends on the number of procs - it is bind-to core for np=2, and bind-to socket for np > 2. You never said, but should I assume you ran 4 ranks? If so, then we should be trying to bind-to socket.
>>>
>>> I'm not sure what your cpuset is telling us - are you binding us to a socket? Are some cpus in one socket, and some in another?
>>>
>>> It could be that the cpuset + bind-to socket is resulting in some odd behavior, but I'd need a little more info to narrow it down.
>>>
>>>
>>> On Jun 18, 2014, at 7:48 PM, Brock Palen <brockp_at_[hidden]> wrote:
>>>
>>>> I have started using 1.8.1 for some codes (meep in this case) and it sometimes works fine, but in a few cases I am seeing ranks being given overlapping CPU assignments, not always though.
>>>>
>>>> Example job, default binding options (so by-core right?):
>>>>
>>>> Assigned nodes, the one in question is nyx5398, we use torque CPU sets, and use TM to spawn.
>>>>
>>>> [nyx5406:2][nyx5427:2][nyx5506:2][nyx5311:3]
>>>> [nyx5329:4][nyx5398:4][nyx5396:11][nyx5397:11]
>>>> [nyx5409:11][nyx5411:11][nyx5412:3]
>>>>
>>>> [root_at_nyx5398 ~]# hwloc-bind --get --pid 16065
>>>> 0x00000200
>>>> [root_at_nyx5398 ~]# hwloc-bind --get --pid 16066
>>>> 0x00000800
>>>> [root_at_nyx5398 ~]# hwloc-bind --get --pid 16067
>>>> 0x00000200
>>>> [root_at_nyx5398 ~]# hwloc-bind --get --pid 16068
>>>> 0x00000800
>>>>
>>>> [root_at_nyx5398 ~]# cat /dev/cpuset/torque/12703230.nyx.engin.umich.edu/cpus
>>>> 8-11
>>>>
>>>> So torque claims the CPU set setup for the job has 4 cores, but as you can see the ranks were giving identical binding.
>>>>
>>>> I checked the pids they were part of the correct CPU set, I also checked, orted:
>>>>
>>>> [root_at_nyx5398 ~]# hwloc-bind --get --pid 16064
>>>> 0x00000f00
>>>> [root_at_nyx5398 ~]# hwloc-calc --intersect PU 16064
>>>> ignored unrecognized argument 16064
>>>>
>>>> [root_at_nyx5398 ~]# hwloc-calc --intersect PU 0x00000f00
>>>> 8,9,10,11
>>>>
>>>> Which is exactly what I would expect.
>>>>
>>>> So ummm, i'm lost why this might happen? What else should I check? Like I said not all jobs show this behavior.
>>>>
>>>> Brock Palen
>>>> www.umich.edu/~brockp
>>>> CAEN Advanced Computing
>>>> XSEDE Campus Champion
>>>> brockp_at_[hidden]
>>>> (734)936-1985
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post: http://www.open-mpi.org/community/lists/users/2014/06/24672.php
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: http://www.open-mpi.org/community/lists/users/2014/06/24673.php
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: http://www.open-mpi.org/community/lists/users/2014/06/24675.php
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: http://www.open-mpi.org/community/lists/users/2014/06/24676.php