Open MPI logo

Hardware Locality Users' Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Users mailing list

Subject: [hwloc-users] hwloc errors on program startup
From: Doug Roberts (roberpj_at_[hidden])
Date: 2014-01-17 11:11:02


1) We are getting hwloc topology errors when programs startup on
some new compute nodes added into our cluster recently ...

[roberpj_at_bro127:~/samples/mpi_test]
/opt/sharcnet/openmpi/1.6.5/intel/bin/mpirun -np 2 --mca btl tcp,sm,self --host
bro127,bro127 ./a.out
****************************************************************************
* Hwloc has encountered what looks like an error from the operating system.
*
* object intersection without inclusion!
* Error occurred in topology.c line 594
*
* Please report this error message to the hwloc user's mailing list,
* along with the output from the hwloc-gather-topology.sh script.
****************************************************************************
Number of processes = 2
Test repeated 3 times for reliability
I am process 0 on node bro127
Run 1 of 3
P0: Sending to P1
I am process 1 on node bro127
P1: Waiting to receive from to P0
P0: Waiting to receive from P1
P0: Received from to P1
Run 2 of 3
P0: Sending to P1
P0: Waiting to receive from P1
P0: Received from to P1
Run 3 of 3
P0: Sending to P1
P0: Waiting to receive from P1
P0: Received from to P1
P0: Done
P1: Sending to to P0
P1: Waiting to receive from to P0
P1: Sending to to P0
P1: Waiting to receive from to P0
P1: Sending to to P0
P1: Done

2) Ive run hwloc-gather-topology.sh and attached bro127.tar.bz2 ...

[roberpj_at_bro127:~/samples/hwloc-gather-topology]
/home/roberpj/builds/hwloc/1.7.2/1.7.2-debug/bin/hwloc-gather-topology $(uname
-n)
Hierarchy gathered in ./bro127.tar.bz2 and kept in /tmp/tmp.Fr37QhvDGD/bro127/
****************************************************************************
* Hwloc has encountered what looks like an error from the operating system.
*
* object (Socket P#0 cpuset 0x000000ff) intersection without inclusion!
* Error occurred in topology.c line 718
*
* Please report this error message to the hwloc user's mailing list,
* along with the output from the hwloc-gather-topology.sh script.
****************************************************************************
Expected topology output stored in ./bro127.output

[roberpj_at_bro127:~/samples/hwloc-gather-topology] cat bro127.output
Machine (P#0 total=67106040KB DMIProductName=empty DMIProductVersion=empty
DMIBoardVendor="TYAN Computer Corporation" DMIBoardName=YR190-B8238
DMIBoardVersion=empty DMIBoardAssetTag=empty DMIChassisVendor=empty
DMIChassisType=3 DMIChassisVersion=empty DMIChassisAssetTag=empty
DMIBIOSVendor="American Megatrends Inc." DMIBIOSVersion='V1.01.B10'
DMIBIOSDate=09/26/2011 DMISysVendor=empty Backend=Linux LinuxCgroup=/)
    NUMANode L#0 (P#0 local=33551608KB total=33551608KB)
      L3Cache L#0 (size=6144KB linesize=64 ways=64)
        L2Cache L#0 (size=2048KB linesize=64 ways=16)
          L1iCache L#0 (size=64KB linesize=64 ways=2)
            L1dCache L#0 (size=16KB linesize=64 ways=4)
              Core L#0 (P#0)
                PU L#0 (P#0)
            L1dCache L#1 (size=16KB linesize=64 ways=4)
              Core L#1 (P#1)
                PU L#1 (P#1)
        L2Cache L#1 (size=2048KB linesize=64 ways=16)
          L1iCache L#1 (size=64KB linesize=64 ways=2)
            L1dCache L#2 (size=16KB linesize=64 ways=4)
              Core L#2 (P#2)
                PU L#2 (P#2)
            L1dCache L#3 (size=16KB linesize=64 ways=4)
              Core L#3 (P#3)
                PU L#3 (P#3)
      L3Cache L#1 (size=6144KB linesize=64 ways=64)
        L2Cache L#2 (size=2048KB linesize=64 ways=16)
          L1iCache L#2 (size=64KB linesize=64 ways=2)
            L1dCache L#4 (size=16KB linesize=64 ways=4)
              Core L#4 (P#0)
                PU L#4 (P#8)
            L1dCache L#5 (size=16KB linesize=64 ways=4)
              Core L#5 (P#1)
                PU L#5 (P#9)
        L2Cache L#3 (size=2048KB linesize=64 ways=16)
          L1iCache L#3 (size=64KB linesize=64 ways=2)
            L1dCache L#6 (size=16KB linesize=64 ways=4)
              Core L#6 (P#2)
                PU L#6 (P#10)
            L1dCache L#7 (size=16KB linesize=64 ways=4)
              Core L#7 (P#3)
                PU L#7 (P#11)
    NUMANode L#1 (P#1 local=33554432KB total=33554432KB)
      L3Cache L#2 (size=6144KB linesize=64 ways=64)
        L2Cache L#4 (size=2048KB linesize=64 ways=16)
          L1iCache L#4 (size=64KB linesize=64 ways=2)
            L1dCache L#8 (size=16KB linesize=64 ways=4)
              Core L#8 (P#0)
                PU L#8 (P#4)
            L1dCache L#9 (size=16KB linesize=64 ways=4)
              Core L#9 (P#1)
                PU L#9 (P#5)
        L2Cache L#5 (size=2048KB linesize=64 ways=16)
          L1iCache L#5 (size=64KB linesize=64 ways=2)
            L1dCache L#10 (size=16KB linesize=64 ways=4)
              Core L#10 (P#2)
                PU L#10 (P#6)
            L1dCache L#11 (size=16KB linesize=64 ways=4)
              Core L#11 (P#3)
                PU L#11 (P#7)
      L3Cache L#3 (size=6144KB linesize=64 ways=64)
        L2Cache L#6 (size=2048KB linesize=64 ways=16)
          L1iCache L#6 (size=64KB linesize=64 ways=2)
            L1dCache L#12 (size=16KB linesize=64 ways=4)
              Core L#12 (P#0)
                PU L#12 (P#12)
            L1dCache L#13 (size=16KB linesize=64 ways=4)
              Core L#13 (P#1)
                PU L#13 (P#13)
        L2Cache L#7 (size=2048KB linesize=64 ways=16)
          L1iCache L#7 (size=64KB linesize=64 ways=2)
            L1dCache L#14 (size=16KB linesize=64 ways=4)
              Core L#14 (P#2)
                PU L#14 (P#14)
            L1dCache L#15 (size=16KB linesize=64 ways=4)
              Core L#15 (P#3)
               PU L#15 (P#15)
depth 0: 1 Machine (type #1)
   depth 1: 2 NUMANode (type #2)
    depth 2: 4 L3Cache (type #4)
     depth 3: 8 L2Cache (type #4)
      depth 4: 8 L1iCache (type #4)
       depth 5: 16 L1dCache (type #4)
        depth 6: 16 Core (type #5)
        depth 7: 16 PU (type #6)
latency matrix between NUMANodes (depth 1) by logical indexes:
    index 0 1
        0 1.000 1.600
        1 1.600 1.000
Topology not from this system

3) SRAT dmesg output was mentioned in another similar ticket
http://www.open-mpi.org/community/lists/hwloc-users/2012/05/0639.php
so i am including ours here also ...

[roberpj_at_bro127:~] dmesg | grep SRAT
ACPI: SRAT 00000000dfdba570 001D0 (v02 AMD AGESA 00000001 AMD 00000001)
SRAT: PXM 0 -> APIC 32 -> Node 0
SRAT: PXM 0 -> APIC 33 -> Node 0
SRAT: PXM 0 -> APIC 34 -> Node 0
SRAT: PXM 0 -> APIC 35 -> Node 0
SRAT: PXM 1 -> APIC 36 -> Node 1
SRAT: PXM 1 -> APIC 37 -> Node 1
SRAT: PXM 1 -> APIC 38 -> Node 1
SRAT: PXM 1 -> APIC 39 -> Node 1
SRAT: PXM 2 -> APIC 64 -> Node 2
SRAT: PXM 2 -> APIC 65 -> Node 2
SRAT: PXM 2 -> APIC 66 -> Node 2
SRAT: PXM 2 -> APIC 67 -> Node 2
SRAT: PXM 3 -> APIC 68 -> Node 3
SRAT: PXM 3 -> APIC 69 -> Node 3
SRAT: PXM 3 -> APIC 70 -> Node 3
SRAT: PXM 3 -> APIC 71 -> Node 3
SRAT: Node 0 PXM 0 0-a0000
SRAT: Node 0 PXM 0 100000-e0000000
SRAT: Node 0 PXM 0 100000000-820000000
SRAT: Node 1 PXM 1 820000000-1020000000

4) Note the nodes have a 10GE interface on eth2 ...

[root_at_bro127:~] nano /var/log/messages (snip)
Jan 15 16:03:55 bro127 kernel: ADDRCONF(NETDEV_UP): eth2: link is not ready
Jan 15 16:03:55 bro127 kernel: ixgbe 0000:04:00.0: eth2: changing MTU from 1500
to 8000
Jan 15 16:03:55 bro127 kernel: ixgbe 0000:04:00.0: eth2: detected SFP+: 3
Jan 15 16:03:55 bro127 kernel: SoftIWARP attached
Jan 15 16:03:55 bro127 kernel: ixgbe 0000:04:00.0: eth2: detected SFP+: 3
Jan 15 16:03:55 bro127 kernel: ixgbe 0000:04:00.0: eth2: NIC Link is Up 10
Gbps, Flow Control: RX/TX
Jan 15 16:03:55 bro127 kernel: ADDRCONF(NETDEV_CHANGE): eth2: link becomes
ready

[roberpj_at_bro127:~] modinfo ixgbe
filename:
/lib/modules/2.6.32-279.5.2.el6.x86_64/kernel/drivers/net/ixgbe/ixgbe.ko
version: 3.6.7-k
license: GPL
description: Intel(R) 10 Gigabit PCI Express Network Driver
author: Intel Corporation, <linux.nics_at_[hidden]>
srcversion: EC64C3345C7AC6AB4BD6F5C
alias: pci: v00008086d0000154Asv*sd*bc*sc*i*
alias: pci: v00008086d00001557sv*sd*bc*sc*i*
alias: pci: v00008086d0000154Fsv*sd*bc*sc*i*
alias: pci: v00008086d0000154Dsv*sd*bc*sc*i*
alias: pci: v00008086d00001528sv*sd*bc*sc*i*
alias: pci: v00008086d000010F8sv*sd*bc*sc*i*
alias: pci: v00008086d0000151Csv*sd*bc*sc*i*
alias: pci: v00008086d00001529sv*sd*bc*sc*i*
alias: pci: v00008086d0000152Asv*sd*bc*sc*i*
alias: pci: v00008086d000010F9sv*sd*bc*sc*i*
alias: pci: v00008086d00001514sv*sd*bc*sc*i*
alias: pci: v00008086d00001507sv*sd*bc*sc*i*
alias: pci: v00008086d000010FBsv*sd*bc*sc*i*
alias: pci: v00008086d00001517sv*sd*bc*sc*i*
alias: pci: v00008086d000010FCsv*sd*bc*sc*i*
alias: pci: v00008086d000010F7sv*sd*bc*sc*i*
alias: pci: v00008086d00001508sv*sd*bc*sc*i*
alias: pci: v00008086d000010DBsv*sd*bc*sc*i*
alias: pci: v00008086d000010F4sv*sd*bc*sc*i*
alias: pci: v00008086d000010E1sv*sd*bc*sc*i*
alias: pci: v00008086d000010F1sv*sd*bc*sc*i*
alias: pci: v00008086d000010ECsv*sd*bc*sc*i*
alias: pci: v00008086d000010DDsv*sd*bc*sc*i*
alias: pci: v00008086d0000150Bsv*sd*bc*sc*i*
alias: pci: v00008086d000010C8sv*sd*bc*sc*i*
alias: pci: v00008086d000010C7sv*sd*bc*sc*i*
alias: pci: v00008086d000010C6sv*sd*bc*sc*i*
alias: pci: v00008086d000010B6sv*sd*bc*sc*i*
depends: mdio,dca
vermagic: 2.6.32-279.5.2.el6.x86_64 SMP mod_unload modversions
parm: IntMode:Change Interrupt Mode (0=Legacy, 1=MSI, 2=MSI-X),
default 2 (array of int)
parm: FdirMode:Flow Director filtering modes (0=Off, 1=Hashing)
default 1 (array of int)
parm: max_vfs:Maximum number of virtual functions to allocate per
physical function (uint)
parm: allow_unsupported_sfp:Allow unsupported and untested SFP+
modules on 82599-based adapters (uint)