Open MPI logo

Hardware Locality Users' Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Users mailing list

Subject: [hwloc-users] Strange binding issue on 40 core nodes and cgroups
From: Brock Palen (brockp_at_[hidden])
Date: 2012-11-02 16:03:34


This isn't a hwloc problem exactly, but maybe you can shed some insight.

We have some 4 socket 10 core = 40 core nodes, HT off:

depth 0: 1 Machine (type #1)
 depth 1: 4 NUMANodes (type #2)
  depth 2: 4 Sockets (type #3)
   depth 3: 4 Caches (type #4)
    depth 4: 40 Caches (type #4)
     depth 5: 40 Caches (type #4)
      depth 6: 40 Cores (type #5)
       depth 7: 40 PUs (type #6)

We run rhel 6.3 we use torque to create cgroups for jobs. I get the following cgroup for this job all 12 cores for the job are on one node:
cat /dev/cpuset/torque/8845236.nyx.engin.umich.edu/cpus
0-1,4-5,8,12,16,20,24,28,32,36

Not all nicely spaced, but 12 cores

I then start a code, even a simple serial code with openmpi 1.6.0 on all 12 cores:
mpirun ./stream

45521 brockp 20 0 1885m 1.8g 456 R 100.0 0.2 4:02.72 stream
45522 brockp 20 0 1885m 1.8g 456 R 100.0 0.2 1:46.08 stream
45525 brockp 20 0 1885m 1.8g 456 R 100.0 0.2 4:02.72 stream
45526 brockp 20 0 1885m 1.8g 456 R 100.0 0.2 1:46.07 stream
45527 brockp 20 0 1885m 1.8g 456 R 100.0 0.2 4:02.71 stream
45528 brockp 20 0 1885m 1.8g 456 R 100.0 0.2 4:02.71 stream
45532 brockp 20 0 1885m 1.8g 456 R 100.0 0.2 1:46.05 stream
45529 brockp 20 0 1885m 1.8g 456 R 99.2 0.2 4:02.70 stream
45530 brockp 20 0 1885m 1.8g 456 R 99.2 0.2 4:02.70 stream
45531 brockp 20 0 1885m 1.8g 456 R 33.6 0.2 1:20.89 stream
45523 brockp 20 0 1885m 1.8g 456 R 32.8 0.2 1:20.90 stream
45524 brockp 20 0 1885m 1.8g 456 R 32.8 0.2 1:20.89 stream

Note the processes that are not running at 100% cpu,

hwloc-bind --get --pid 45523
0x00000011,0x11111133
<the same mask is reported for all 12 processes>

hwloc-calc 0x00000011,0x11111133 --intersect PU
0,1,2,3,4,5,6,7,8,9,10,11

So all ranks in the job should see all 12 cores. The same cgroup is reported by /proc/<pid>/cgroup

Not only that I can make things work by forcing binding in the mpi launcher:
mpirun -bind-to-core ./stream

46886 brockp 20 0 1885m 1.8g 456 R 99.8 0.2 0:15.49 stream
46887 brockp 20 0 1885m 1.8g 456 R 99.8 0.2 0:15.49 stream
46888 brockp 20 0 1885m 1.8g 456 R 99.8 0.2 0:15.48 stream
46889 brockp 20 0 1885m 1.8g 456 R 99.8 0.2 0:15.49 stream
46890 brockp 20 0 1885m 1.8g 456 R 99.8 0.2 0:15.48 stream
46891 brockp 20 0 1885m 1.8g 456 R 99.8 0.2 0:15.48 stream
46892 brockp 20 0 1885m 1.8g 456 R 99.8 0.2 0:15.47 stream
46893 brockp 20 0 1885m 1.8g 456 R 99.8 0.2 0:15.47 stream
46894 brockp 20 0 1885m 1.8g 456 R 99.8 0.2 0:15.47 stream
46895 brockp 20 0 1885m 1.8g 456 R 99.8 0.2 0:15.47 stream
46896 brockp 20 0 1885m 1.8g 456 R 99.8 0.2 0:15.46 stream
46897 brockp 20 0 1885m 1.8g 456 R 99.8 0.2 0:15.46 stream

Things are now working as expected, and I should stress this is inside the same torque job and cgroup that I started with.

A multi threaded version of the code does use close to 12 cores as expected.

If I cervumvent out batch system and the cgroups a normal mpirun ./stream does start 12 processes that consume a full 100% core.

Thoughts? This is really odd linux scheduler behavior.

Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
brockp_at_[hidden]
(734)936-1985