Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] affinity issues under cpuset torque 1.8.1
From: Brock Palen (brockp_at_[hidden])
Date: 2014-06-18 22:48:24

I have started using 1.8.1 for some codes (meep in this case) and it sometimes works fine, but in a few cases I am seeing ranks being given overlapping CPU assignments, not always though.

Example job, default binding options (so by-core right?):

Assigned nodes, the one in question is nyx5398, we use torque CPU sets, and use TM to spawn.


[root_at_nyx5398 ~]# hwloc-bind --get --pid 16065
[root_at_nyx5398 ~]# hwloc-bind --get --pid 16066
[root_at_nyx5398 ~]# hwloc-bind --get --pid 16067
[root_at_nyx5398 ~]# hwloc-bind --get --pid 16068
[root_at_nyx5398 ~]# cat /dev/cpuset/torque/

So torque claims the CPU set setup for the job has 4 cores, but as you can see the ranks were giving identical binding.

I checked the pids they were part of the correct CPU set, I also checked, orted:

[root_at_nyx5398 ~]# hwloc-bind --get --pid 16064
[root_at_nyx5398 ~]# hwloc-calc --intersect PU 16064
ignored unrecognized argument 16064

[root_at_nyx5398 ~]# hwloc-calc --intersect PU 0x00000f00

Which is exactly what I would expect.

So ummm, i'm lost why this might happen? What else should I check? Like I said not all jobs show this behavior.

Brock Palen
CAEN Advanced Computing
XSEDE Campus Champion