Le 03/10/2013 02:56, Panos Labropoulos a écrit :
Hallo,


I initially posted this at users@open-mpi.org.

We seem to be unable to to set the cpu binding on a cluster consisting of Dell M420/M610 systems:

[jallan@hpc21 ~]$ cat report-bindings.sh #!/bin/sh

bitmap=`hwloc-bind --get -p`
friendly=`hwloc-calc -p -H socket.core.pu $bitmap`

echo "MCW rank $OMPI_COMM_WORLD_RANK (`hostname`): $friendly"
exit 0


[jallan@hpc27 ~]$ hwloc-bind -v  socket:0.core:0 -l ./report-bindings.sh
using object #0 depth 2 below cpuset 0x000000ff
using object #0 depth 6 below cpuset 0x00000080
adding 0x00000080 to 0x0
adding 0x00000080 to 0x0
assuming the command starts at ./report-bindings.sh
binding on cpu set 0x00000080
MCW rank  (hpc27): Socket:0.Core:10.PU:7
[jallan@hpc27 ~]$ hwloc-bind -v  socket:1.core:0 -l ./report-bindings.sh
object #1 depth 2 (type socket) below cpuset 0x000000ff does not exist
adding 0x0 to 0x0
assuming the command starts at ./report-bindings.sh
MCW rank  (hpc27): Socket:0.Core:10.PU:7


The topology of this system looks a bit strange:

[jallan@hpc21 ~]$ lstopo --no-io
Machine (24GB)
 NUMANode L#0 (P#0 24GB)
 NUMANode L#1 (P#1) + Socket L#0 + L3 L#0 (15MB) + L2 L#0 (256KB) + L1
L#0 (32KB) + Core L#0 + PU L#0 (P#11)
[jallan@hpc21 ~]$


You likely have some Linux cpuset that restrict the available CPUs. That's why the first socket object doesn't appear in lstopo above. And that's why "socket:1" fails in other commands: there's no socket with logical index 1.

If you're allocating jobs with a batch scheduler, the problem will go away if you reserve all cores of the node instead of a single one.

If you really want to play with manual binding on that restricted platform, you also have to manually play with the unavailable resources.

Otherwise you can generate the entire topology with "lstopo --whole-system foo.xml" and then use it with "normal" socket numbers: "hwloc-bind -i foo.xml socket:1.core:0 etc". You won't get errors about objects anymore, but you may get new errors about failures to bind if you try to bind to objects outside the restricted topology.

Brice