Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] Trunk broken on NERSC's Cray XE6
From: Paul Hargrove (phhargrove_at_[hidden])
Date: 2013-01-25 21:52:00


Following up as I promised...

My results on NERSC's small Cray XE6 (the test/dev rack "Grace", rather
than the full-sized "Hopper") match those I get on the Cray XC30 (Edison),
and don't follow those Ralph reports for LANL's XE6.

An attempt to build/link hello_c.c results in unresolved symbols from
libnuma, libxpmem and libugni.
A complete list is available if it matters.

This is still with last night's openmpi-1.9a1r27905 tarball, and the
following 1-line mod to the platform file:
- enable_shared=yes
+ enable_shared=no

If it will help determine what is going on, I can probably get NERSC
accounts for any of the DOE Lab folks easily.
They will only get access to the full-sized XE6 (Hopper) for now.

In case any of these are helpful clues to the difference(s):
$ module list
Currently Loaded Modulefiles:
  1) modules/3.2.6.6 18)
dvs/1.8.6_0.9.0-1.0401.1401.1.120
  2) torque/4.1.4-snap.201211160904 19)
csa/3.0.0-1_2.0401.37452.4.50.gem
  3) moab/6.0.4 20)
job/1.5.5-0.1_2.0401.35380.1.10.gem
  4) xtpe-network-gemini 21)
xpmem/0.1-2.0401.36790.4.3.gem
  5) cray-mpich2/5.6.0 22)
gni-headers/2.1-1.0401.5675.4.4.gem
  6) atp/1.6.0 23)
dmapp/3.2.1-1.0401.5983.4.5.gem
  7) xe-sysroot/4.1.40 24)
pmi/4.0.0-1.0000.9282.69.4.gem
  8) switch/1.0-1.0401.36779.2.72.gem 25)
ugni/4.0-1.0401.5928.9.5.gem
  9) shared-root/1.0-1.0401.37253.3.50.gem 26)
udreg/2.3.2-1.0401.5929.3.3.gem
 10) pdsh/2.26-1.0401.37449.1.1.gem 27) xt-libsci/12.0.00
 11) nodehealth/5.0-1.0401.38460.12.18.gem 28) gcc/4.7.2
 12) lbcd/2.1-1.0401.35360.1.2.gem 29) xt-asyncpe/5.16
 13) hosts/1.0-1.0401.35364.1.115.gem 30) eswrap/1.0.10
 14) configuration/1.0-1.0401.35391.1.2.gem 31) xtpe-mc12
 15) ccm/2.2.0-1.0401.37254.2.142 32) cray-shmem/5.6.0
 16) audit/1.0.0-1.0401.37969.2.32.gem 33) PrgEnv-gnu/4.1.40
 17) rca/1.0.0-2.0401.38656.2.2.gem

-Paul

On Fri, Jan 25, 2013 at 5:50 PM, Paul Hargrove <phhargrove_at_[hidden]> wrote:

> Ralph,
>
> Again our results differ.
> I did NOT need the additional #include to link a simple test program.
> I am going to try on our XE6 shortly.
>
> I suspect you are right about something in the configury being different.
> I am willing to try a few more nightly tarballs if somebody thinks they
> have the proper fix.
>
> -Paul
>
>
> On Fri, Jan 25, 2013 at 5:45 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>
>>
>> On Jan 25, 2013, at 5:12 PM, Paul Hargrove <phhargrove_at_[hidden]> wrote:
>>
>> Ralph,
>>
>> Those are the result of the missing -lnuma that Nathan already identified
>> earlier as missing in BOTH 1.7 and trunk.
>> I see MORE missing symbols, which include ones from libxpmem and libugni.
>>
>>
>> Alright, let me try to be clearer. We are missing -lnuma as well as the
>> required include file - both are necessary to remove the issue.
>>
>> I find both the xpmem and ugni libraries *are* correctly included in both
>> 1.7 and trunk. It could be a case of finding them in the configury, but we
>> are finding them *and* correctly including them on the XE6.
>>
>> HTH
>> Ralph
>>
>>
>> -Paul
>>
>>
>> On Fri, Jan 25, 2013 at 4:59 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>
>>>
>>> On Jan 25, 2013, at 4:53 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>> > The repeated libs is something we obviously should fix, but all the
>>> libs are there - including lustre. I guess those were dropped due to the
>>> shared lib setting, so we probably should fix that in the platform file.
>>> >
>>> > Perhaps that is the cause of Nathan's issue? shrug...regardless, apps
>>> build and run just fine using mpicc for me.
>>>
>>> Correction - turns out I misspoke. I find apps *don't* build correctly
>>> with this setup:
>>>
>>> mpicc -g hello_c.c -o hello_c
>>> /usr/aprojects/hpctools/rhc/build/lib/libopen-pal.a(topology-linux.o):
>>> In function `hwloc_linux_set_area_membind':
>>> /lscratch1/rcastain/openmpi-1.9a1/opal/mca/hwloc/hwloc151/hwloc/src/topology-linux.c:1116:
>>> undefined reference to `mbind'
>>> /lscratch1/rcastain/openmpi-1.9a1/opal/mca/hwloc/hwloc151/hwloc/src/topology-linux.c:1135:
>>> undefined reference to `mbind'
>>> /usr/aprojects/hpctools/rhc/build/lib/libopen-pal.a(topology-linux.o):
>>> In function `hwloc_linux_get_area_membind':
>>> /lscratch1/rcastain/openmpi-1.9a1/opal/mca/hwloc/hwloc151/hwloc/src/topology-linux.c:1337:
>>> undefined reference to `get_mempolicy'
>>> /usr/aprojects/hpctools/rhc/build/lib/libopen-pal.a(topology-linux.o):
>>> In function `hwloc_linux_find_kernel_max_numnodes':
>>> /lscratch1/rcastain/openmpi-1.9a1/opal/mca/hwloc/hwloc151/hwloc/src/topology-linux.c:1239:
>>> undefined reference to `get_mempolicy'
>>> /usr/aprojects/hpctools/rhc/build/lib/libopen-pal.a(topology-linux.o):
>>> In function `hwloc_linux_set_thisthread_membind':
>>> /lscratch1/rcastain/openmpi-1.9a1/opal/mca/hwloc/hwloc151/hwloc/src/topology-linux.c:1183:
>>> undefined reference to `set_mempolicy'
>>> /lscratch1/rcastain/openmpi-1.9a1/opal/mca/hwloc/hwloc151/hwloc/src/topology-linux.c:1194:
>>> undefined reference to `migrate_pages'
>>> /lscratch1/rcastain/openmpi-1.9a1/opal/mca/hwloc/hwloc151/hwloc/src/topology-linux.c:1206:
>>> undefined reference to `set_mempolicy'
>>> /usr/aprojects/hpctools/rhc/build/lib/libopen-pal.a(topology-linux.o):
>>> In function `hwloc_linux_get_thisthread_membind':
>>> /lscratch1/rcastain/openmpi-1.9a1/opal/mca/hwloc/hwloc151/hwloc/src/topology-linux.c:1284:
>>> undefined reference to `get_mempolicy'
>>> /usr/aprojects/hpctools/rhc/build/lib/libopen-pal.a(topology-linux.o):
>>> In function `hwloc_linux_find_kernel_max_numnodes':
>>> /lscratch1/rcastain/openmpi-1.9a1/opal/mca/hwloc/hwloc151/hwloc/src/topology-linux.c:1239:
>>> undefined reference to `get_mempolicy'
>>> collect2: ld returned 1 exit status
>>> make: *** [hello_c] Error 1
>>>
>>> So it looks like hwloc is borked when built static.
>>>
>>> Sigh
>>> Ralph
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>>
>>
>> --
>> Paul H. Hargrove PHHargrove_at_[hidden]
>> Future Technologies Group
>> Computer and Data Sciences Department Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>
>
> --
> Paul H. Hargrove PHHargrove_at_[hidden]
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>

-- 
Paul H. Hargrove                          PHHargrove_at_[hidden]
Future Technologies Group
Computer and Data Sciences Department     Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900