Open MPI logo

Hardware Locality Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Development mailing list

Subject: Re: [hwloc-devel] hwloc-1.4 assertion failures on Linux/POWER7
From: Brice Goglin (Brice.Goglin_at_[hidden])
Date: 2012-02-01 16:59:41


The topology of the virtual node is a bit unusual, I am reproducing a
similar setup with Linux cgroups. I already found some problems there,
no idea if they are related to yours, we'll see when I'll have some patches.

Brice

Le 01/02/2012 21:07, Paul H. Hargrove a écrit :
> Responses interspersed w/ your questions, below.
> -Paul
>
> On 2/1/2012 6:13 AM, Brice Goglin wrote:
>> Can you run hwloc-gather-topology and send the resulting tarball and
>> output ?
>
> Attached.
>
>> We've seen some powerpc machines where the old kernel didn't say much
>> about the topology, so your 8 cores with 4 threads appeared as 32 things
>> without much details about their organization. I assume you can't
>> upgrade the kernel. Which kernel is this?
>
> I am told the VM spans 1 socket of 8 cores, and each core has 4 threads.
> /proc/cpuinfo doesn't show any "structure".
> So, when lstopo reports the machine as (8 sockets X 1 core X 4
> threads), it was probably as close as it could be w/o the "missing"
> information. [note that I MISreported lstopo's output as (8 sockets X
> 4 cores) in my previous email].
>
> I am a guest on this machine and can't change the kernel nor add
> accounts.
>> $ uname -a
>> Linux biou2.rice.edu 2.6.32-131.6.1.el6.ppc64 #1 SMP Tue Sep 13
>> 15:16:45 CDT 2011 ppc64 ppc64 ppc64 GNU/Linux
> Which isn't really all that old.
>
>
>> Yes the virtual node thing could also make things more wrong. What kind
>> of "virtualization" is this?
>
>
> I don't know for certain, but would guess they are using the stuff
> described in Chapter 3 of the pdf I gave the URL for.
> I don't think RHEL6 has any other virtualization support for PPC.
>
>> Thanks
>> Brice
>>
>>
>> Le 01/02/2012 04:29, Paul H. Hargrove a écrit :
>>> This node is an IBM "Power 750 Express server", described in detail at
>>> http://www.redbooks.ibm.com/redpapers/pdfs/redp4638.pdf
>>>
>>> Notably it is a quad-socket chassis which can take 6-core or 8-core
>>> processors.
>>> However, lstopo is reporting 8 sockets of 4-cores each.
>>> This discrepancy lead me to recall the following from an email sent to
>>> me by a colleague:
>>>> A surprise
>>>> to me is that the login nodes provide the appearance of having 32
>>>> cpu's, but those are in fact only 8 cores with 4 hyper-threads,
>>>> and they are in fact VM's on top of one socket of a compute node.
>>> So, I am not really certain what I should expect lstopo to report.
>>> I suppose it is accurately reporting to me the virtual node's
>>> configuration.
>>>
>>> I bring this up because it may very well be related to the assertion
>>> failures.
>>> My guess here being that some part of hwloc has seen past the
>>> "virtual" to see the "physical" and the assertion failure reflects the
>>> resulting inconsistency. But that is just a guess. Let me know how I
>>> might help debug this failure.
>>>
>>> -Paul
>>>
>>> On 1/31/2012 7:12 PM, Paul H. Hargrove wrote:
>>>> The problem I reported below also exists in hwloc-1.4.1.
>>>> Additionally, I can reproduce the SEGVs with xlc which Chris Samuel
>>>> reported in
>>>>
>>>> http://www.open-mpi.org/community/lists/hwloc-devel/2012/01/2738.php
>>>>
>>>> -Paul
>>>>
>>>> On 1/31/2012 5:56 PM, Paul H. Hargrove wrote:
>>>>> When running "make check" in hwloc-1.3.1 on a Linux/POWER7 system I
>>>>> see:
>>>>>> lt-linux-libnuma:
>>>>>> /users/phh1/OMPI/hwloc-1.3.1-linux-ppc64-gcc//hwloc-1.3.1/tests/linux-libnuma.c:53:
>>>>>>
>>>>>> main: Assertion `hwloc_bitmap_isequal(set, set2)' failed.
>>>>>> /bin/sh: line 5: 21415 Aborted ${dir}$tst
>>>>>> FAIL: linux-libnuma
>>>>> I've reproduced that failure with 4 different compilers (3 gcc's and
>>>>> an xlc).
>>>>> The xlc-built hwloc-1.3.1 also fails an additional test:
>>>>>> lt-glibc-sched:
>>>>>> /users/phh1/OMPI/hwloc-1.3.1-linux-ppc64-xlc-11.1//hwloc-1.3.1/tests/glibc-sched.c:43:
>>>>>>
>>>>>> main: Assertion `!err' failed.
>>>>>> /bin/sh: line 5: 7077 Aborted ${dir}$tst
>>>>>> FAIL: glibc-sched
>>>>>
>>>>> The contents of /proc/cpuinfo are:
>>>>>> processor : 0
>>>>>> cpu : POWER7 (architected), altivec supported
>>>>>> clock : 3550.000000MHz
>>>>>> revision : 2.0 (pvr 003f 0200)
>>>>>>
>>>>>> [30 more of the same]
>>>>>>
>>>>>> processor : 31
>>>>>> cpu : POWER7 (architected), altivec supported
>>>>>> clock : 3550.000000MHz
>>>>>> revision : 2.0 (pvr 003f 0200)
>>>>>>
>>>>>> timebase : 512000000
>>>>>> platform : pSeries
>>>>>> model : IBM,8233-E8B
>>>>>> machine : CHRP IBM,8233-E8B
>>>>> Let me know of any other h/w or s/w info I can report.
>>>>>
>>>>> -Paul
>>>>>
>