It was a large job spread across. Our system allows users to ask for 'procs' which are laid out in any format.
Shows that nyx5406 had 2 cores, nyx5427 also 2, nyx5411 had 11.
They could be spread across any number of sockets configuration. We start very lax "user requests X procs" and then the user can request more strict requirements from there. We support mostly serial users, and users can colocate on nodes.
That is good to know, I think we would want to turn our default to 'bind to core' except for our few users who use hybrid mode.
Our CPU set tells you what cores the job is assigned. So in the problem case provided, the cpuset/cgroup shows only cores 8-11 are available to this job on this node.
CAEN Advanced Computing
XSEDE Campus Champion
On Jun 18, 2014, at 11:10 PM, Ralph Castain <rhc_at_[hidden]> wrote:
> The default binding option depends on the number of procs - it is bind-to core for np=2, and bind-to socket for np > 2. You never said, but should I assume you ran 4 ranks? If so, then we should be trying to bind-to socket.
> I'm not sure what your cpuset is telling us - are you binding us to a socket? Are some cpus in one socket, and some in another?
> It could be that the cpuset + bind-to socket is resulting in some odd behavior, but I'd need a little more info to narrow it down.
> On Jun 18, 2014, at 7:48 PM, Brock Palen <brockp_at_[hidden]> wrote:
>> I have started using 1.8.1 for some codes (meep in this case) and it sometimes works fine, but in a few cases I am seeing ranks being given overlapping CPU assignments, not always though.
>> Example job, default binding options (so by-core right?):
>> Assigned nodes, the one in question is nyx5398, we use torque CPU sets, and use TM to spawn.
>> [root_at_nyx5398 ~]# hwloc-bind --get --pid 16065
>> [root_at_nyx5398 ~]# hwloc-bind --get --pid 16066
>> [root_at_nyx5398 ~]# hwloc-bind --get --pid 16067
>> [root_at_nyx5398 ~]# hwloc-bind --get --pid 16068
>> [root_at_nyx5398 ~]# cat /dev/cpuset/torque/12703230.nyx.engin.umich.edu/cpus
>> So torque claims the CPU set setup for the job has 4 cores, but as you can see the ranks were giving identical binding.
>> I checked the pids they were part of the correct CPU set, I also checked, orted:
>> [root_at_nyx5398 ~]# hwloc-bind --get --pid 16064
>> [root_at_nyx5398 ~]# hwloc-calc --intersect PU 16064
>> ignored unrecognized argument 16064
>> [root_at_nyx5398 ~]# hwloc-calc --intersect PU 0x00000f00
>> Which is exactly what I would expect.
>> So ummm, i'm lost why this might happen? What else should I check? Like I said not all jobs show this behavior.
>> Brock Palen
>> CAEN Advanced Computing
>> XSEDE Campus Champion
>> users mailing list
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: http://www.open-mpi.org/community/lists/users/2014/06/24672.php
> users mailing list
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: http://www.open-mpi.org/community/lists/users/2014/06/24673.php