Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: [OMPI devel] Trunk broken on NERSC's Cray XE6
From: Paul Hargrove (phhargrove_at_[hidden])
Date: 2013-01-25 21:52:00


Following up as I promised...

My results on NERSC's small Cray XE6 (the test/dev rack "Grace", rather
than the full-sized "Hopper") match those I get on the Cray XC30 (Edison),
and don't follow those Ralph reports for LANL's XE6.

An attempt to build/link hello_c.c results in unresolved symbols from
libnuma, libxpmem and libugni.
A complete list is available if it matters.

This is still with last night's openmpi-1.9a1r27905 tarball, and the
following 1-line mod to the platform file:
- enable_shared=yes
+ enable_shared=no

If it will help determine what is going on, I can probably get NERSC
accounts for any of the DOE Lab folks easily.
They will only get access to the full-sized XE6 (Hopper) for now.

In case any of these are helpful clues to the difference(s):
$ module list
Currently Loaded Modulefiles:
  1) modules/3.2.6.6 18)
dvs/1.8.6_0.9.0-1.0401.1401.1.120
  2) torque/4.1.4-snap.201211160904 19)
csa/3.0.0-1_2.0401.37452.4.50.gem
  3) moab/6.0.4 20)
job/1.5.5-0.1_2.0401.35380.1.10.gem
  4) xtpe-network-gemini 21)
xpmem/0.1-2.0401.36790.4.3.gem
  5) cray-mpich2/5.6.0 22)
gni-headers/2.1-1.0401.5675.4.4.gem
  6) atp/1.6.0 23)
dmapp/3.2.1-1.0401.5983.4.5.gem
  7) xe-sysroot/4.1.40 24)
pmi/4.0.0-1.0000.9282.69.4.gem
  8) switch/1.0-1.0401.36779.2.72.gem 25)
ugni/4.0-1.0401.5928.9.5.gem
  9) shared-root/1.0-1.0401.37253.3.50.gem 26)
udreg/2.3.2-1.0401.5929.3.3.gem
 10) pdsh/2.26-1.0401.37449.1.1.gem 27) xt-libsci/12.0.00
 11) nodehealth/5.0-1.0401.38460.12.18.gem 28) gcc/4.7.2
 12) lbcd/2.1-1.0401.35360.1.2.gem 29) xt-asyncpe/5.16
 13) hosts/1.0-1.0401.35364.1.115.gem 30) eswrap/1.0.10
 14) configuration/1.0-1.0401.35391.1.2.gem 31) xtpe-mc12
 15) ccm/2.2.0-1.0401.37254.2.142 32) cray-shmem/5.6.0
 16) audit/1.0.0-1.0401.37969.2.32.gem 33) PrgEnv-gnu/4.1.40
 17) rca/1.0.0-2.0401.38656.2.2.gem

-Paul

On Fri, Jan 25, 2013 at 5:50 PM, Paul Hargrove <phhargrove_at_[hidden]> wrote:

> Ralph,
>
> Again our results differ.
> I did NOT need the additional #include to link a simple test program.
> I am going to try on our XE6 shortly.
>
> I suspect you are right about something in the configury being different.
> I am willing to try a few more nightly tarballs if somebody thinks they
> have the proper fix.
>
> -Paul
>
>
> On Fri, Jan 25, 2013 at 5:45 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>
>>
>> On Jan 25, 2013, at 5:12 PM, Paul Hargrove <phhargrove_at_[hidden]> wrote:
>>
>> Ralph,
>>
>> Those are the result of the missing -lnuma that Nathan already identified
>> earlier as missing in BOTH 1.7 and trunk.
>> I see MORE missing symbols, which include ones from libxpmem and libugni.
>>
>>
>> Alright, let me try to be clearer. We are missing -lnuma as well as the
>> required include file - both are necessary to remove the issue.
>>
>> I find both the xpmem and ugni libraries *are* correctly included in both
>> 1.7 and trunk. It could be a case of finding them in the configury, but we
>> are finding them *and* correctly including them on the XE6.
>>
>> HTH
>> Ralph
>>
>>
>> -Paul
>>
>>
>> On Fri, Jan 25, 2013 at 4:59 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>
>>>
>>> On Jan 25, 2013, at 4:53 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>> > The repeated libs is something we obviously should fix, but all the
>>> libs are there - including lustre. I guess those were dropped due to the
>>> shared lib setting, so we probably should fix that in the platform file.
>>> >
>>> > Perhaps that is the cause of Nathan's issue? shrug...regardless, apps
>>> build and run just fine using mpicc for me.
>>>
>>> Correction - turns out I misspoke. I find apps *don't* build correctly
>>> with this setup:
>>>
>>> mpicc -g hello_c.c -o hello_c
>>> /usr/aprojects/hpctools/rhc/build/lib/libopen-pal.a(topology-linux.o):
>>> In function `hwloc_linux_set_area_membind':
>>> /lscratch1/rcastain/openmpi-1.9a1/opal/mca/hwloc/hwloc151/hwloc/src/topology-linux.c:1116:
>>> undefined reference to `mbind'
>>> /lscratch1/rcastain/openmpi-1.9a1/opal/mca/hwloc/hwloc151/hwloc/src/topology-linux.c:1135:
>>> undefined reference to `mbind'
>>> /usr/aprojects/hpctools/rhc/build/lib/libopen-pal.a(topology-linux.o):
>>> In function `hwloc_linux_get_area_membind':
>>> /lscratch1/rcastain/openmpi-1.9a1/opal/mca/hwloc/hwloc151/hwloc/src/topology-linux.c:1337:
>>> undefined reference to `get_mempolicy'
>>> /usr/aprojects/hpctools/rhc/build/lib/libopen-pal.a(topology-linux.o):
>>> In function `hwloc_linux_find_kernel_max_numnodes':
>>> /lscratch1/rcastain/openmpi-1.9a1/opal/mca/hwloc/hwloc151/hwloc/src/topology-linux.c:1239:
>>> undefined reference to `get_mempolicy'
>>> /usr/aprojects/hpctools/rhc/build/lib/libopen-pal.a(topology-linux.o):
>>> In function `hwloc_linux_set_thisthread_membind':
>>> /lscratch1/rcastain/openmpi-1.9a1/opal/mca/hwloc/hwloc151/hwloc/src/topology-linux.c:1183:
>>> undefined reference to `set_mempolicy'
>>> /lscratch1/rcastain/openmpi-1.9a1/opal/mca/hwloc/hwloc151/hwloc/src/topology-linux.c:1194:
>>> undefined reference to `migrate_pages'
>>> /lscratch1/rcastain/openmpi-1.9a1/opal/mca/hwloc/hwloc151/hwloc/src/topology-linux.c:1206:
>>> undefined reference to `set_mempolicy'
>>> /usr/aprojects/hpctools/rhc/build/lib/libopen-pal.a(topology-linux.o):
>>> In function `hwloc_linux_get_thisthread_membind':
>>> /lscratch1/rcastain/openmpi-1.9a1/opal/mca/hwloc/hwloc151/hwloc/src/topology-linux.c:1284:
>>> undefined reference to `get_mempolicy'
>>> /usr/aprojects/hpctools/rhc/build/lib/libopen-pal.a(topology-linux.o):
>>> In function `hwloc_linux_find_kernel_max_numnodes':
>>> /lscratch1/rcastain/openmpi-1.9a1/opal/mca/hwloc/hwloc151/hwloc/src/topology-linux.c:1239:
>>> undefined reference to `get_mempolicy'
>>> collect2: ld returned 1 exit status
>>> make: *** [hello_c] Error 1
>>>
>>> So it looks like hwloc is borked when built static.
>>>
>>> Sigh
>>> Ralph
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>>
>>
>> --
>> Paul H. Hargrove PHHargrove_at_[hidden]
>> Future Technologies Group
>> Computer and Data Sciences Department Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>
>
> --
> Paul H. Hargrove PHHargrove_at_[hidden]
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>

-- 
Paul H. Hargrove                          PHHargrove_at_[hidden]
Future Technologies Group
Computer and Data Sciences Department     Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900