Open MPI logo

Hardware Locality Users' Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Users mailing list

Subject: Re: [hwloc-users] Understanding hwloc-ps output
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2012-05-30 10:59:43


Short version:
==============

OMPI 1.6.soon-to-be-1 will report *logical* hwloc core bitmasks (not PUs!). The reasons for this are sordid and, frankly, uninteresting. :-\

Perhaps we need to update this to be something a bit more user-friendly before 1.6.1 goes final. Hrm...

More detail:
============

Here's the lstopo of a node of mine that has hyperthreading enabled:

-----
# Logical lstopo output
% lstopo --no-io
Machine (24GB)
  NUMANode L#0 (P#0 12GB) + Socket L#0 + L3 L#0 (8192KB)
    L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
      PU L#0 (P#0)
      PU L#1 (P#8)
    L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
      PU L#2 (P#2)
      PU L#3 (P#10)
    L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
      PU L#4 (P#4)
      PU L#5 (P#12)
    L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
      PU L#6 (P#6)
      PU L#7 (P#14)
  NUMANode L#1 (P#1 12GB) + Socket L#1 + L3 L#1 (8192KB)
    L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
      PU L#8 (P#1)
      PU L#9 (P#9)
    L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
      PU L#10 (P#3)
      PU L#11 (P#11)
    L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
      PU L#12 (P#5)
      PU L#13 (P#13)
    L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
      PU L#14 (P#7)
      PU L#15 (P#15)

# Note the physical lstopo output -- my sockets are physically ordered
# "backwards" for some weird reason. Shrug. This is important to note
# for the example below, however.

% lstopo -p --no-io
Machine (24GB)
  NUMANode P#0 (12GB) + Socket P#1 + L3 (8192KB)
    L2 (256KB) + L1d (32KB) + L1i (32KB) + Core P#0
      PU P#0
      PU P#8
    L2 (256KB) + L1d (32KB) + L1i (32KB) + Core P#1
      PU P#2
      PU P#10
    L2 (256KB) + L1d (32KB) + L1i (32KB) + Core P#2
      PU P#4
      PU P#12
    L2 (256KB) + L1d (32KB) + L1i (32KB) + Core P#3
      PU P#6
      PU P#14
  NUMANode P#1 (12GB) + Socket P#0 + L3 (8192KB)
    L2 (256KB) + L1d (32KB) + L1i (32KB) + Core P#0
      PU P#1
      PU P#9
    L2 (256KB) + L1d (32KB) + L1i (32KB) + Core P#1
      PU P#3
      PU P#11
    L2 (256KB) + L1d (32KB) + L1i (32KB) + Core P#2
      PU P#5
      PU P#13
    L2 (256KB) + L1d (32KB) + L1i (32KB) + Core P#3
      PU P#7
      PU P#15
-----

Here's an mpirun with --bind-to-core --bycore with the new 1.6.soon-to-be-1 stuff (but not yet committed to the v1.6 SVN branch):

-----
% cat report-bindings.sh
#!/bin/sh

bitmap=`hwloc-bind --get -p`
friendly=`hwloc-calc -p -H socket.core.pu $bitmap`

echo "MCW rank $OMPI_COMM_WORLD_RANK (`hostname`): $friendly"
exit 0
% mpirun --mca btl tcp,sm,self --report-bindings --host svbu-mpi056 --np 2 --bind-to-core --bycore ./report-bindings.sh
[svbu-mpi056:23643] [[11178,0],1] odls:default:fork binding child [[11178,1],0] to cpus 0001
[svbu-mpi056:23643] [[11178,0],1] odls:default:fork binding child [[11178,1],1] to cpus 0002
MCW rank 0 (svbu-mpi056): Socket:1.Core:0.PU:0 Socket:1.Core:0.PU:8
MCW rank 1 (svbu-mpi056): Socket:1.Core:1.PU:2 Socket:1.Core:1.PU:10
%
-----

Specifically: OMPI 1.6.soon-to-be-1 is binding MCW rank 0 to both the PUs in physical socket:1.core:0, and binding MCW rank 1 to all the PUs in physical socket:1.core:1. **Remember** My sockets are physically ordered opposite of their logical ordering; see the lstopo -p output, above.

I don't remember offhand what kind of bitmask OMPI 1.4.x outputs. I'd be kinda surprised if it binds to core X on socket A and core Y on socket B (where A != B), though...

The more I think about this, the more I think that if the OMPI 1.6 series is going to have a decent shelf life (as the new stable series), we should make --report-bindings output something more user friendly. I'll work on that.

As a holdover until I get that done, note that you can configure OMPI with --enable-mpi-ext=affinity to enable the OMPI_Affinity_str() function. See the OMPI_Affinity_str(3) man page for details.

We have an example OMPI_Affinity_str() program in OMPI 1.7 that you can compile with OMPI 1.6; it prettyprints the current bindings. For example:

-----
% wget --no-check-certificate http://svn.open-mpi.org/svn/ompi/trunk/ompi/mpiext/affinity/c/example.c
[...snip wget output...]
% mpicc example.c -o example
% mpirun --mca btl tcp,sm,self --report-bindings --host svbu-mpi056 --np 2 --bind-to-core --bycore ./example
[svbu-mpi056:24312] [[19149,0],1] odls:default:fork binding child [[19149,1],0] to cpus 0001
[svbu-mpi056:24312] [[19149,0],1] odls:default:fork binding child [[19149,1],1] to cpus 0002
rank 0 (resource string):
       ompi_bound: socket 0[core 0]
  current_binding: socket 0[core 0]
           exists: socket 0 has 4 cores, socket 1 has 4 cores
rank 0 (layout):
       ompi_bound: [B . . .][. . . .]
  current_binding: [B . . .][. . . .]
           exists: [. . . .][. . . .]
rank 1 (resource string):
       ompi_bound: socket 0[core 1]
  current_binding: socket 0[core 1]
           exists: socket 0 has 4 cores, socket 1 has 4 cores
rank 1 (layout):
       ompi_bound: [. B . .][. . . .]
  current_binding: [. B . .][. . . .]
           exists: [. . . .][. . . .]
%
-----

Note, too, that OMPI 1.6 only lets you bind to sockets and cores, which is why the above output doesn't show hyperthreads (even though they are there, according to the lstopo output).

That being said, we have completely revamped process/processor affinity support in what will become OMPI v1.7 (i.e., the current OMPI SVN trunk). For example, OMPI 1.7 will let you bind to hyperthreads (and caches and ...others). If you run the same example OMPI_Affinity_str() program with what will become OMPI v1.7, the output is a little more expressive -- it shows the hyperthreads:

-----
% cd <my OMPI SVN trunk checkout>
% cd ompi/mpiext/affinity/c
% mpicc example.c -o example
% mpirun --mca btl tcp,sm,self --report-bindings --host svbu-mpi056 --np 2 --bind-to-core ./example
[svbu-mpi056:25041] [[23016,0],1] odls:default binding child [[23016,1],0] to cpus 0,8
[svbu-mpi056:25041] [[23016,0],1] odls:default binding child [[23016,1],1] to cpus 2,10
[svbu-mpi056:25042] [[23016,1],0] is bound to cpus 0,8
[svbu-mpi056:25043] [[23016,1],1] is bound to cpus 2,10
rank 0 (resource string):
       ompi_bound: socket 1[core 0[hwt 0-1]]
  current_binding: socket 1[core 0[hwt 0-1]]
           exists: socket 1 has 4 cores, each with 2 hwts; socket 0 has 4 cores, each with 2 hwts
rank 0 (layout):
       ompi_bound: [BB/../../..][../../../..]
  current_binding: [BB/../../..][../../../..]
           exists: [../../../..][../../../..]
rank 1 (resource string):
       ompi_bound: socket 1[core 1[hwt 0-1]]
  current_binding: socket 1[core 1[hwt 0-1]]
           exists: socket 1 has 4 cores, each with 2 hwts; socket 0 has 4 cores, each with 2 hwts
rank 1 (layout):
       ompi_bound: [../BB/../..][../../../..]
  current_binding: [../BB/../..][../../../..]
           exists: [../../../..][../../../..]
%
-----

I notice the --report-bindings output is a bit different in 1.7 vs. 1.6. We should clarify this stuff, make it user-friendly, and make it the same (as much as possible) between 1.6.x and 1.7.x. I'll work on that.

On May 30, 2012, at 10:06 AM, Brice Goglin wrote:

> Jeff,
> What is the displayed bitmask in OMPI 1.6? Is it the hwloc bitmask? Or
> the OMPI bitmask made of OMPI indexes?
> Brice
>
>
>
> Le 30/05/2012 16:01, Jeff Squyres a écrit :
>> You might want to try the OMPI tarball that is about to become OMPI v1.6.1 -- we made a bunch of affinity-related fixes, and it should be much more predictable / stable in what it does in terms of process binding:
>>
>> http://www.open-mpi.org/~jsquyres/unofficial/
>>
>> (these affinity fixes are not yet in a nightly 1.6 tarball because we're testing them before they get committed to the OMPI v1.6 SVN branch)
>>
>>
>> On May 30, 2012, at 9:54 AM, Brice Goglin wrote:
>>
>>> Hello Youri,
>>> When using openmpi 1.4.4 with --np 2 --bind-to-core --bycore” it reports the following:
>>>> [hostname:03339] [[17125,0],0] odls:default:fork binding child [[17125,1],0] to cpus 0001
>>>>
>>>> [hostname:03339] [[17125,0],0] odls:default:fork binding child [[17125,1],1] to cpus 0002
>>>>
>>> Bitmask 0001 and 0002 mean CPUs with physical indexes 0 and 1 in OMPI 1.4. So that corresponds to the first core of each socket, and that matches what hwloc-ps says. Try "hwloc-ps -c" should show the same bitmask.
>>>
>>> However, I agree that these are not adjacent cores, but I don't know enough of OMPI binding options to understand what it was supposed to do in your case.
>>>
>>> Brice
>>>
>>> _______________________________________________
>>> hwloc-users mailing list
>>> hwloc-users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>>
>
> _______________________________________________
> hwloc-users mailing list
> hwloc-users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/