This isn't a hwloc problem exactly, but maybe you can shed some insight.
We have some 4 socket 10 core = 40 core nodes, HT off:
depth 0: 1 Machine (type #1)
depth 1: 4 NUMANodes (type #2)
depth 2: 4 Sockets (type #3)
depth 3: 4 Caches (type #4)
depth 4: 40 Caches (type #4)
depth 5: 40 Caches (type #4)
depth 6: 40 Cores (type #5)
depth 7: 40 PUs (type #6)
We run rhel 6.3 we use torque to create cgroups for jobs. I get the following cgroup for this job all 12 cores for the job are on one node:
cat /dev/cpuset/torque/8845236.nyx.engin.umich.edu/cpus
0-1,4-5,8,12,16,20,24,28,32,36
Not all nicely spaced, but 12 cores
I then start a code, even a simple serial code with openmpi 1.6.0 on all 12 cores:
mpirun ./stream
45521 brockp 20 0 1885m 1.8g 456 R 100.0 0.2 4:02.72 stream
45522 brockp 20 0 1885m 1.8g 456 R 100.0 0.2 1:46.08 stream
45525 brockp 20 0 1885m 1.8g 456 R 100.0 0.2 4:02.72 stream
45526 brockp 20 0 1885m 1.8g 456 R 100.0 0.2 1:46.07 stream
45527 brockp 20 0 1885m 1.8g 456 R 100.0 0.2 4:02.71 stream
45528 brockp 20 0 1885m 1.8g 456 R 100.0 0.2 4:02.71 stream
45532 brockp 20 0 1885m 1.8g 456 R 100.0 0.2 1:46.05 stream
45529 brockp 20 0 1885m 1.8g 456 R 99.2 0.2 4:02.70 stream
45530 brockp 20 0 1885m 1.8g 456 R 99.2 0.2 4:02.70 stream
45531 brockp 20 0 1885m 1.8g 456 R 33.6 0.2 1:20.89 stream
45523 brockp 20 0 1885m 1.8g 456 R 32.8 0.2 1:20.90 stream
45524 brockp 20 0 1885m 1.8g 456 R 32.8 0.2 1:20.89 stream
Note the processes that are not running at 100% cpu,
hwloc-bind --get --pid 45523
0x00000011,0x11111133
<the same mask is reported for all 12 processes>
hwloc-calc 0x00000011,0x11111133 --intersect PU
0,1,2,3,4,5,6,7,8,9,10,11
So all ranks in the job should see all 12 cores. The same cgroup is reported by /proc/<pid>/cgroup
Not only that I can make things work by forcing binding in the mpi launcher:
mpirun -bind-to-core ./stream
46886 brockp 20 0 1885m 1.8g 456 R 99.8 0.2 0:15.49 stream
46887 brockp 20 0 1885m 1.8g 456 R 99.8 0.2 0:15.49 stream
46888 brockp 20 0 1885m 1.8g 456 R 99.8 0.2 0:15.48 stream
46889 brockp 20 0 1885m 1.8g 456 R 99.8 0.2 0:15.49 stream
46890 brockp 20 0 1885m 1.8g 456 R 99.8 0.2 0:15.48 stream
46891 brockp 20 0 1885m 1.8g 456 R 99.8 0.2 0:15.48 stream
46892 brockp 20 0 1885m 1.8g 456 R 99.8 0.2 0:15.47 stream
46893 brockp 20 0 1885m 1.8g 456 R 99.8 0.2 0:15.47 stream
46894 brockp 20 0 1885m 1.8g 456 R 99.8 0.2 0:15.47 stream
46895 brockp 20 0 1885m 1.8g 456 R 99.8 0.2 0:15.47 stream
46896 brockp 20 0 1885m 1.8g 456 R 99.8 0.2 0:15.46 stream
46897 brockp 20 0 1885m 1.8g 456 R 99.8 0.2 0:15.46 stream
Things are now working as expected, and I should stress this is inside the same torque job and cgroup that I started with.
A multi threaded version of the code does use close to 12 cores as expected.
If I cervumvent out batch system and the cgroups a normal mpirun ./stream does start 12 processes that consume a full 100% core.
Thoughts? This is really odd linux scheduler behavior.
Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
brockp_at_[hidden]
(734)936-1985
|