Open MPI logo

Hardware Locality Users' Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Users mailing list

Subject: [hwloc-users] Solaris and hwloc
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2012-09-12 10:16:14


Brice / Samuel --

How well does hwloc work for process binding on Solaris? This is not something I've followed closely (note that Terry Dontje has moved on to other projects inside Oracle, so he's no longer my go-to guy for All Things Solaris...).

Siegmar Gross (CC'ed) originally had a binding problem in Open MPI, but we've narrowed it down to some simple binding tests with hwloc, just to avoid all the OMPI complications.

I've asked him to run hwloc-bind on a few different configurations, and run my report-bindings.sh script (see below) so that it reports where it was actually bound. He seems to get an hwloc error any time he tries to bind to more than 1 PU. Is that expected on Solaris?

Sidenote: if hwloc-bind fails to bind, should we still launch the child process?

Here's my trivial report-bindings.sh script:

-----
#!/bin/sh

bitmap=`hwloc-bind --get -p`
friendly=`hwloc-calc -p -H socket.core.pu $bitmap`

echo "MCW rank $OMPI_COMM_WORLD_RANK (`hostname`): $friendly"
exit 0
------

See Seigmar's detailed reply, below.

On Sep 11, 2012, at 8:22 AM, Siegmar Gross wrote:

> Hi,
>
> I have purged the old stuff in the mail.
>
>> It's concerning that you cannot bind to a full core (i.e., all
>> the pu in a core). Does Solaris not allow you to bind to multiple
>> pu's in a single process?
>
> Unfortunately I don't know because I haven't used it up to now.
> "mpstat" sees all hardware threads as cpu's.
>
> rs0 fd1026 104 mpstat
> CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
> 0 1 0 16 224 8 36 0 0 1 0 127 0 0 0 100
> 1 1 0 38 69 40 38 0 0 1 0 146 0 0 0 100
> 2 2 0 18 57 28 41 0 0 1 0 169 0 0 0 100
> 3 1 0 14 40 11 40 0 0 1 0 152 0 0 0 100
> 4 1 0 13 41 11 41 0 0 1 0 149 0 0 0 100
> 5 1 0 17 43 12 42 0 0 1 0 178 0 0 0 100
> 6 2 0 15 43 11 44 0 0 1 0 171 0 0 0 100
> 7 1 0 14 42 11 41 0 0 1 0 156 0 0 0 100
> 8 1 0 10 34 9 32 0 0 0 0 46 0 0 0 100
> 9 1 0 11 34 9 32 0 0 1 0 82 0 0 0 100
> 10 1 0 10 32 8 30 0 0 1 0 55 0 0 0 100
> 11 0 0 10 31 8 29 0 0 0 0 51 0 0 0 100
> 12 0 0 9 30 8 28 0 0 0 0 46 0 0 0 100
> 13 1 0 11 29 7 27 0 0 0 0 59 0 0 0 100
> 14 1 0 11 33 8 29 0 0 1 0 68 0 0 0 100
> 15 0 0 11 29 7 26 0 0 0 0 48 0 0 0 100
>
>
> I found the following addresses which state that it is possible
> to bind a process to a processor set.
>
> http://developers.sun.com/solaris/articles/solaris_processor.html
> http://stackoverflow.com/questions/10277221/binding-process-to-multiple-processors-on-sun
> -solaris-os
>
>
>> Please repeat the hwloc-bind tests for both 1.3 and 1.5, but run
>> the report bindings script instead of date. That will show where
>> the child process was actually bound.
>
>
> ssh rs0
> cd hwloc
> set path = ( `pwd`/hwloc-1.3.2/bin $path )
> setenv LD_LIBRARY_PATH_32 `pwd`/hwloc-1.3.2/lib:${LD_LIBRARY_PATH_32}
>
>
> I always get "errno 18 Cross-device link" if I use
> "socket:*.core:*". No diference between "-l" and "-p". I
> don't see differences in the output but I can provide the
> output for all 16 hardware threads with both "-l" and "-p"
> if you need it.
>
> rs0 hwloc 107 which hwloc-bind
> /home/fd1026/hwloc/hwloc-1.3.2/bin/hwloc-bind
>
> rs0 hwloc 108 hwloc-bind socket:0.core:0 -l report-bindings.sh
> hwloc_set_cpubind 0x00000003 failed (errno 18 Cross-device link)
> MCW rank (rs0.informatik.hs-fulda.de): Socket:1024.Core:0.PU:0 Socket:1024.Core:0.PU:1
> Socket:1024.Core:2.PU:2 Socket:1024.Core:2.PU:3 Socket:1024.Core:4.PU:4
> Socket:1024.Core:4.PU:5 Socket:1024.Core:6.PU:6 Socket:1024.Core:6.PU:7
> Socket:1032.Core:8.PU:8 Socket:1032.Core:8.PU:9 Socket:1032.Core:10.PU:10
> Socket:1032.Core:10.PU:11 Socket:1032.Core:12.PU:12 Socket:1032.Core:12.PU:13
> Socket:1032.Core:14.PU:14 Socket:1032.Core:14.PU:15
>
> rs0 hwloc 114 hwloc-bind socket:0.core:0 -p report-bindings.sh
> hwloc_set_cpubind 0x00000003 failed (errno 18 Cross-device link)
> MCW rank (rs0.informatik.hs-fulda.de): Socket:1024.Core:0.PU:0 Socket:1024.Core:0.PU:1
> Socket:1024.Core:2.PU:2 Socket:1024.Core:2.PU:3 Socket:1024.Core:4.PU:4
> Socket:1024.Core:4.PU:5 Socket:1024.Core:6.PU:6 Socket:1024.Core:6.PU:7
> Socket:1032.Core:8.PU:8 Socket:1032.Core:8.PU:9 Socket:1032.Core:10.PU:10
> Socket:1032.Core:10.PU:11 Socket:1032.Core:12.PU:12 Socket:1032.Core:12.PU:13
> Socket:1032.Core:14.PU:14 Socket:1032.Core:14.PU:15
>
> rs0 hwloc 118 hwloc-bind socket:1.core:3 -l report-bindings.sh
> hwloc_set_cpubind 0x0000c000 failed (errno 18 Cross-device link)
> MCW rank (rs0.informatik.hs-fulda.de): Socket:1024.Core:0.PU:0 Socket:1024.Core:0.PU:1
> Socket:1024.Core:2.PU:2 Socket:1024.Core:2.PU:3 Socket:1024.Core:4.PU:4
> Socket:1024.Core:4.PU:5 Socket:1024.Core:6.PU:6 Socket:1024.Core:6.PU:7
> Socket:1032.Core:8.PU:8 Socket:1032.Core:8.PU:9 Socket:1032.Core:10.PU:10
> Socket:1032.Core:10.PU:11 Socket:1032.Core:12.PU:12 Socket:1032.Core:12.PU:13
> Socket:1032.Core:14.PU:14 Socket:1032.Core:14.PU:15
>
> rs0 hwloc 119 hwloc-bind socket:1.core:3 -p report-bindings.sh
> hwloc_set_cpubind 0x0000c000 failed (errno 18 Cross-device link)
> MCW rank (rs0.informatik.hs-fulda.de): Socket:1024.Core:0.PU:0 Socket:1024.Core:0.PU:1
> Socket:1024.Core:2.PU:2 Socket:1024.Core:2.PU:3 Socket:1024.Core:4.PU:4
> Socket:1024.Core:4.PU:5 Socket:1024.Core:6.PU:6 Socket:1024.Core:6.PU:7
> Socket:1032.Core:8.PU:8 Socket:1032.Core:8.PU:9 Socket:1032.Core:10.PU:10
> Socket:1032.Core:10.PU:11 Socket:1032.Core:12.PU:12 Socket:1032.Core:12.PU:13
> Socket:1032.Core:14.PU:14 Socket:1032.Core:14.PU:15
>
>
> I get no error if I use "pu:*" but I don't see a difference in the
> output. For me the output looks always the same independent of
> "pu:0", ..., "pu:15".
>
> rs0 hwloc 120 hwloc-bind pu:0 -l report-bindings.sh
> MCW rank (rs0.informatik.hs-fulda.de): Socket:1024.Core:0.PU:0 Socket:1024.Core:0.PU:1
> Socket:1024.Core:2.PU:2 Socket:1024.Core:2.PU:3 Socket:1024.Core:4.PU:4
> Socket:1024.Core:4.PU:5 Socket:1024.Core:6.PU:6 Socket:1024.Core:6.PU:7
> Socket:1032.Core:8.PU:8 Socket:1032.Core:8.PU:9 Socket:1032.Core:10.PU:10
> Socket:1032.Core:10.PU:11 Socket:1032.Core:12.PU:12 Socket:1032.Core:12.PU:13
> Socket:1032.Core:14.PU:14 Socket:1032.Core:14.PU:15
>
> rs0 hwloc 121 hwloc-bind pu:0 -p report-bindings.sh
> MCW rank (rs0.informatik.hs-fulda.de): Socket:1024.Core:0.PU:0 Socket:1024.Core:0.PU:1
> Socket:1024.Core:2.PU:2 Socket:1024.Core:2.PU:3 Socket:1024.Core:4.PU:4
> Socket:1024.Core:4.PU:5 Socket:1024.Core:6.PU:6 Socket:1024.Core:6.PU:7
> Socket:1032.Core:8.PU:8 Socket:1032.Core:8.PU:9 Socket:1032.Core:10.PU:10
> Socket:1032.Core:10.PU:11 Socket:1032.Core:12.PU:12 Socket:1032.Core:12.PU:13
> Socket:1032.Core:14.PU:14 Socket:1032.Core:14.PU:15
>
>
> Now the same things for hwloc-1.5:
>
> rs0 hwloc 106 which hwloc-bind
> /usr/local/bin/hwloc-bind
>
> rs0 hwloc 107 hwloc-bind socket:0.core:0 -l report-bindings.sh
> hwloc_set_cpubind 0x00000003 failed (errno 18 Cross-device link)
> MCW rank (rs0.informatik.hs-fulda.de): Socket:1024.Core:0.PU:0 Socket:1024.Core:0.PU:1
> Socket:1024.Core:2.PU:2 Socket:1024.Core:2.PU:3 Socket:1024.Core:4.PU:4
> Socket:1024.Core:4.PU:5 Socket:1024.Core:6.PU:6 Socket:1024.Core:6.PU:7
> Socket:1032.Core:8.PU:8 Socket:1032.Core:8.PU:9 Socket:1032.Core:10.PU:10
> Socket:1032.Core:10.PU:11 Socket:1032.Core:12.PU:12 Socket:1032.Core:12.PU:13
> Socket:1032.Core:14.PU:14 Socket:1032.Core:14.PU:15
>
> rs0 hwloc 108 hwloc-bind socket:0.core:0 -p report-bindings.sh
> hwloc_set_cpubind 0x00000003 failed (errno 18 Cross-device link)
> MCW rank (rs0.informatik.hs-fulda.de): Socket:1024.Core:0.PU:0 Socket:1024.Core:0.PU:1
> Socket:1024.Core:2.PU:2 Socket:1024.Core:2.PU:3 Socket:1024.Core:4.PU:4
> Socket:1024.Core:4.PU:5 Socket:1024.Core:6.PU:6 Socket:1024.Core:6.PU:7
> Socket:1032.Core:8.PU:8 Socket:1032.Core:8.PU:9 Socket:1032.Core:10.PU:10
> Socket:1032.Core:10.PU:11 Socket:1032.Core:12.PU:12 Socket:1032.Core:12.PU:13
> Socket:1032.Core:14.PU:14 Socket:1032.Core:14.PU:15
>
> rs0 hwloc 109 hwloc-bind socket:1.core:3 -l report-bindings.sh
> hwloc_set_cpubind 0x0000c000 failed (errno 18 Cross-device link)
> MCW rank (rs0.informatik.hs-fulda.de): Socket:1024.Core:0.PU:0 Socket:1024.Core:0.PU:1
> Socket:1024.Core:2.PU:2 Socket:1024.Core:2.PU:3 Socket:1024.Core:4.PU:4
> Socket:1024.Core:4.PU:5 Socket:1024.Core:6.PU:6 Socket:1024.Core:6.PU:7
> Socket:1032.Core:8.PU:8 Socket:1032.Core:8.PU:9 Socket:1032.Core:10.PU:10
> Socket:1032.Core:10.PU:11 Socket:1032.Core:12.PU:12 Socket:1032.Core:12.PU:13
> Socket:1032.Core:14.PU:14 Socket:1032.Core:14.PU:15
>
> rs0 hwloc 110 hwloc-bind socket:1.core:3 -p report-bindings.sh
> hwloc_set_cpubind 0x0000c000 failed (errno 18 Cross-device link)
> MCW rank (rs0.informatik.hs-fulda.de): Socket:1024.Core:0.PU:0 Socket:1024.Core:0.PU:1
> Socket:1024.Core:2.PU:2 Socket:1024.Core:2.PU:3 Socket:1024.Core:4.PU:4
> Socket:1024.Core:4.PU:5 Socket:1024.Core:6.PU:6 Socket:1024.Core:6.PU:7
> Socket:1032.Core:8.PU:8 Socket:1032.Core:8.PU:9 Socket:1032.Core:10.PU:10
> Socket:1032.Core:10.PU:11 Socket:1032.Core:12.PU:12 Socket:1032.Core:12.PU:13
> Socket:1032.Core:14.PU:14 Socket:1032.Core:14.PU:15
>
>
> rs0 hwloc 112 hwloc-bind pu:0 -l report-bindings.sh
> MCW rank (rs0.informatik.hs-fulda.de): Socket:1024.Core:0.PU:0 Socket:1024.Core:0.PU:1
> Socket:1024.Core:2.PU:2 Socket:1024.Core:2.PU:3 Socket:1024.Core:4.PU:4
> Socket:1024.Core:4.PU:5 Socket:1024.Core:6.PU:6 Socket:1024.Core:6.PU:7
> Socket:1032.Core:8.PU:8 Socket:1032.Core:8.PU:9 Socket:1032.Core:10.PU:10
> Socket:1032.Core:10.PU:11 Socket:1032.Core:12.PU:12 Socket:1032.Core:12.PU:13
> Socket:1032.Core:14.PU:14 Socket:1032.Core:14.PU:15
>
> rs0 hwloc 113 hwloc-bind pu:0 -p report-bindings.sh
> MCW rank (rs0.informatik.hs-fulda.de): Socket:1024.Core:0.PU:0 Socket:1024.Core:0.PU:1
> Socket:1024.Core:2.PU:2 Socket:1024.Core:2.PU:3 Socket:1024.Core:4.PU:4
> Socket:1024.Core:4.PU:5 Socket:1024.Core:6.PU:6 Socket:1024.Core:6.PU:7
> Socket:1032.Core:8.PU:8 Socket:1032.Core:8.PU:9 Socket:1032.Core:10.PU:10
> Socket:1032.Core:10.PU:11 Socket:1032.Core:12.PU:12 Socket:1032.Core:12.PU:13
> Socket:1032.Core:14.PU:14 Socket:1032.Core:14.PU:15
>
> Is the above output helpful? Thank you very much for your help in advance.
> Do you know a C++ application which I can try to test our compiler?
>
>
> Kind regards
>
> Siegmar
>
>
> ##########################################################################
> # #
> # Hochschule Fulda University of Applied Sciences #
> # FB Angewandte Informatik Department of Applied Computer Science #
> # #
> # Prof. Dr. Siegmar Gross Tel.: +49 (0)661 9640 - 333 #
> # Fax: +49 (0)661 9640 - 349 #
> # Marquardstr. 35 WWW: http://www.hs-fulda.de/~gross #
> # E-Mail: Siegmar.Gross_at_[hidden] #
> # D-36039 Fulda #
> # #
> ##########################################################################
>

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/