Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Nathan DeBardeleben (ndebard_at_[hidden])
Date: 2005-09-15 15:47:07


Jeff and everyone else I contacted about this: thanks for helping track
down the problem. I've been beating my head on this for a few days and
don't have the library experience to have caught these nuances. Thanks
again!

-- Nathan
Correspondence
---------------------------------------------------------------------
Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
Parallel Tools Team
High Performance Computing Environments
phone: 505-667-3428
email: ndebard_at_[hidden]
---------------------------------------------------------------------

Jeff Squyres wrote:

>Followup for the list... a bit of explanation of Nathan's problem about
>shared libraries and unresolved symbols.
>
>Short version:
>--------------
>
>It's an OMPI bug when built as a shared library (not an issue for
>static libraries). The fix is straightforward, but involves grunt
>work. I'll try to get a student to do it RSN.
>
>Long version:
>-------------
>
>What's happening is that we are not linking OMPI components against the
>opal/orte/ompi libraries. As such, we are exploiting the fact that
>when they are dlopened by a standalone application (e.g., a.out), the
>Libtool portable version of dlopen() exports all the symbols from the
>parent process such that the child can find and use them at run-time to
>resolve any unknown symbols. Here's an example (I'm leaving out some
>fine-grained details, and it's slightly different on different OS's,
>but this is "true enough" for the purposes of this thread):
>
>- a.out, which was linked against libopal.so (and friends), launches
>- the linker runs into an unresolved symbol
>- the linker sees that that symbols is supposed to be in "libopal.so",
>and starts searching LD_LIBRARY_PATH for it
>- the linker finds libopal.so, loads it, and is able to resolve the
>symbol
>
>It gets interesting at this part:
>
>- within MPI_Init()/orte_init()/opal_init() (i.e., however you
>initialized yourself to OMPI/ORTE/OPAL), we use the Libtool portable
>dlopen() to open our components
>- the components will have unresolved symbols as well (i.e., the
>symbols in libopal, liborte, and libmpi)
>- when the linker hits these, it tries to resolve them.
>- first, the linker looks in the public namespace of the process, and
>if it finds the symbols there, it's done
>- in this case, libopal (and friends) have already been loaded in the
>process, so the linker can find the symbols right away -- without
>loading any additional libraries
>
>This is the scheme that we were relying on for libopal/orte/ompi
>symbols to be resolved in our components. And for standalone
>executables, it works fine.
>
>But for an environment like Eclipse, it doesn't.
>
>I don't know anything about Eclipse, but I'm assuming that it does
>something similar to our component system -- it dlopen's them. However
>-- here's where my guess comes in -- it doesn't make all the symbols in
>the opened component be in the public namespace of the process (this is
>different than what OMPI does, for various reasons). Hence, if you
>build an Eclipse component against OMPI, the Eclipse component will be
>dynamically linked against libopal (etc.). So when Eclipse loads in
>your component, similar to the standalone executable example above, the
>linker will realize that it has unresolved symbols and will use the
>normal mechanism to resolve them (e.g., look for libopal.so in
>LD_LIBRARY_PATH).
>
>The problem comes in when we dlopen OMPI/ORTE/OPAL components.
>
>Our scheme assumed that we'd be able to find the opal/orte/ompi symbols
>in the public namespace of the parent process. But they're not --
>Eclipse loaded the component in a private namespace, and therefore all
>the opal/orte/ompi symbols are in that private namespace. And
>therefore the OMPI/ORTE/OPAL components can't find the symbols, and the
>linker barfs.
>
>The solution is to change our scheme in OMPI a bit. We just need to
>add a few lines to all the component Makefile.am's to, in the dynamic
>case, link the components against their relevant libraries (opal
>components linked against libopal, orte components linked against
>liborte and libopal, etc.). This does not make the components
>significantly larger -- it just adds an entry into the dynamic linker
>section of the component's resulting .so file indicating "if you have
>unresolved components, go look in libopal.so" (etc.).
>
>This allows the components themselves to pull in shared libraries when
>they are dlopened -- if they need to. If the symbols can be resolved
>in the parent process' public symbol namespace, they still will be (as
>in the standalone executable example, above). But if they can't be
>resolved that way, this gives the ability to explicitly find and pull
>in a shared library and resolve the symbols that way (as in the Eclipse
>plugin example, above).
>
>Aren't computers fun? :-)
>
>
>On Sep 14, 2005, at 12:47 PM, Nathan DeBardeleben wrote:
>
>
>
>>Let me explain what I'm doing real quickly.
>>
>>I have a piece of Java code which is calling OMPI calls. It's doing
>>this through JNI (java native interface). Don't worry, you don't have
>>to understand Java to try and help me here. The JNI code is C with
>>some funky macros in it provided by Java.
>>
>>I have to compile the JNI C code into a shared library and then the
>>Java code will load it dynamically when that class is instantiated.
>>
>>So - here's my compile line:
>>
>>
>>
>>>[sparkplug]~/<2>ompi > mpicc -I /usr/java/jdk1.5.0_04/include -I
>>>/usr/java/jdk1.5.0_04/include/linux -c ptp_ompi_jni.c -fPIC
>>> [sparkplug]~/<2>ompi > mpicc -I
>>>/usr/java/jdk1.5.0_04/include -I /usr/java/jdk1.5.0_04/include/linux
>>>-shared -o libptp_ompi_jni.so ptp_ompi_jni.o
>>>
>>>
>>I then have libptp_ompi_jni.so. I then load that from within Java.
>>If I setup my LD_LIBRARY_PATH and some args to the Java VM correctly,
>>then it finds the above library and loads it up. OK - all fine so
>>far.
>>
>>However, when I call 'orte_init()' it craps out with the following
>>error:
>>
>>
>>
>>>/usr/java/jdk1.5.0_04/bin/java: error while loading shared libraries:
>>>/home/ndebard/local/ompi/lib/openmpi/mca_paffinity_linux.so:
>>>undefined symbol: mca_base_param_reg_int
>>>
>>>
>>So I went digging in mca_paffinity_linux.so looking for that symbol.
>>
>>
>>
>>>[sparkplug]~/<3>openmpi > nm mca_paffinity_linux.so | grep
>>>mca_base_param_reg
>>> U mca_base_param_reg_int
>>>[sparkplug]~/<3>openmpi >
>>>
>>>
>>OK. So it's undefined in that .so.
>>I'm really not a library guy (can't you tell from my myriad of
>>mails?). What does this mean? I went back digging in the parent
>>directory, /home/ndebard/local/ompi/lib, to find the symbol.
>>
>>
>>
>>>[sparkplug]~/<2>lib > nm libopal.so | grep mca_base_param_reg_int
>>>000000000001ce00 T mca_base_param_reg_int
>>>000000000001cea3 T mca_base_param_reg_int_name
>>>[sparkplug]~/<2>lib >
>>>
>>>
>>OK so I read this as it's defined in opal.so.
>>Do you have any idea why my JNI library is trying to load
>>mca_paffinity_linux.so?
>>I went back to my compile line and added -lopal -lmpi -lorte just in
>>case, but that didn't help.
>>
>>Again, Jeff, I know this isn't really your concern (unless you want a
>>wicked OMPI graphical demo at SC!) :) but I wanted to drop it out
>>there in case you had any insight. I'm kinda stumped on this one.
>>
>>Does it mean my ompi compile is bad?
>>
>>-- Nathan
>>Correspondence
>>---------------------------------------------------------------------
>>Nathan DeBardeleben, Ph.D.
>>Los Alamos National Laboratory
>>Parallel Tools Team
>>High Performance Computing Environments
>>phone: 505-667-3428
>>email: ndebard_at_[hidden]
>>---------------------------------------------------------------------
>>
>>
>>
>>Jeff Squyres wrote:
>>
>>
>>
>>>Maybe I'm dense -- I thought you couldn't use --shared when linking
>>>to a static library...?
>>>
>>>If you want to build OMPI as a shared library, then ditch the
>>>--enable-static --disable-shared from your configure line (building
>>>OMPI as shared is the default, which is how I build 95% of the time).
>>>
>>>
>>>
>>>On Sep 12, 2005, at 5:47 PM, Nathan DeBardeleben wrote:
>>>
>>>
>>>
>>>
>>>>I've been having this problem for a week or so and I've been asking
>>>>other people to weigh in if they know what I'm doing wrong. I've
>>>>gotten
>>>>no where on this so I figure I'll finally drop it out on the list.
>>>>First, here's the important info:
>>>>The machine:
>>>>
>>>>
>>>>
>>>>
>>>>>[sparkplug]~ > cat /etc/issue
>>>>>
>>>>>Welcome to SuSE Linux 9.1 (x86-64) - Kernel \r (\l).
>>>>>
>>>>>
>>>>>[sparkplug]~ > uname -a
>>>>>Linux sparkplug 2.6.10 #4 SMP Wed Jan 26 11:50:00 MST 2005 x86_64
>>>>>x86_64 x86_64 GNU/Linux
>>>>>
>>>>>
>>>>>
>>>>My versions of libtool, autoconf, automake:
>>>>
>>>>
>>>>
>>>>
>>>>>[sparkplug]~ > libtool --version
>>>>>ltmain.sh (GNU libtool) 1.5.20 (1.1220.2.287 2005/08/31 18:54:15)
>>>>>
>>>>>Copyright (C) 2005 Free Software Foundation, Inc.
>>>>>This is free software; see the source for copying conditions.
>>>>>There is NO
>>>>>warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
>>>>>PURPOSE.
>>>>>[sparkplug]~ > autoconf --version
>>>>>autoconf (GNU Autoconf) 2.59
>>>>>Written by David J. MacKenzie and Akim Demaille.
>>>>>
>>>>>Copyright (C) 2003 Free Software Foundation, Inc.
>>>>>This is free software; see the source for copying conditions.
>>>>>There is NO
>>>>>warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
>>>>>PURPOSE.
>>>>>[sparkplug]~ > automake --version
>>>>>automake (GNU automake) 1.8.5
>>>>>Written by Tom Tromey <tromey_at_[hidden]>.
>>>>>
>>>>>Copyright 2004 Free Software Foundation, Inc.
>>>>>This is free software; see the source for copying conditions.
>>>>>There is NO
>>>>>warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
>>>>>PURPOSE.
>>>>>[sparkplug]~ >
>>>>>
>>>>>
>>>>>
>>>>My ompi version: 7322 - but this has been going on for a few days
>>>>like I
>>>>said and I've been updating a lot, with no progress.
>>>>
>>>>Configured using:
>>>>
>>>>
>>>>
>>>>
>>>>>$ ./configure --enable-static --disable-shared --without-threads
>>>>>--prefix=/home/ndebard/local/ompi --with-devel-headers
>>>>>--enable-mca-no-build=ptl-gm
>>>>>
>>>>>
>>>>>
>>>>Simple C file which I will compile into a shared library:
>>>>
>>>>
>>>>
>>>>
>>>>>int test_compile(int x) {
>>>>> int rc;
>>>>>
>>>>> rc = orte_init(true);
>>>>> printf("rc = %d\n", rc);
>>>>>
>>>>> return x + 1;
>>>>>}
>>>>>
>>>>>
>>>>>
>>>>Above file is named 'testlib.c'
>>>>
>>>>OK, so let's build this:
>>>>
>>>>
>>>>
>>>>
>>>>>[sparkplug]~/ompi-test > mpicc -c testlib.c
>>>>>[sparkplug]~/ompi-test > mpicc -shared -o libtestlib.so testlib.o
>>>>>/usr/lib64/gcc-lib/x86_64-suse-linux/3.3.3/../../../../x86_64-suse-
>>>>>linux/bin/ld:
>>>>>testlib.o: relocation R_X86_64_32 can not be used when making a
>>>>>shared
>>>>>object; recompile with -fPIC
>>>>>testlib.o: could not read symbols: Bad value
>>>>>collect2: ld returned 1 exit status
>>>>>
>>>>>
>>>>>
>>>>OK so relocation problems. Maybe I'll follow the directions and
>>>>-fPIC
>>>>my file myself:
>>>>
>>>>
>>>>
>>>>
>>>>>[sparkplug]~/ompi-test > mpicc -c testlib.c -fPIC
>>>>>[sparkplug]~/ompi-test > mpicc -shared -o libtestlib.so testlib.o
>>>>>/usr/lib64/gcc-lib/x86_64-suse-linux/3.3.3/../../../../x86_64-suse-
>>>>>linux/bin/ld:
>>>>>/home/ndebard/local/ompi/lib/liborte.a(orte_init.o): relocation
>>>>>R_X86_64_32 can not be used when making a shared object; recompile
>>>>>with -fPIC
>>>>>/home/ndebard/local/ompi/lib/liborte.a: could not read symbols: Bad
>>>>> value
>>>>>collect2: ld returned 1 exit status
>>>>>
>>>>>
>>>>>
>>>>OK so I read this as there's a relocation problem in 'liborte.a'. I
>>>>un-arred liborte.a and checked some of the files with 'file' and it
>>>>says
>>>>64bit. I havn't yet written a script to check every file in here,
>>>>but
>>>>here's orte_init.o:
>>>>
>>>>
>>>>
>>>>
>>>>>[sparkplug]~/<1>tmp > file orte_init.o
>>>>>orte_init.o: ELF 64-bit LSB relocatable, AMD x86-64, version 1
>>>>>(SYSV),
>>>>>not stripped
>>>>>
>>>>>
>>>>>
>>>>So that at least says it's 64bit.
>>>>And to confirm, my mpicc's 64bit too:
>>>>
>>>>
>>>>
>>>>
>>>>>[sparkplug]~/<1>tmp > which mpicc
>>>>>/home/ndebard/local/ompi/bin/mpicc
>>>>>[sparkplug]~/<1>tmp > file /home/ndebard/local/ompi/bin/mpicc
>>>>>/home/ndebard/local/ompi/bin/mpicc: ELF 64-bit LSB executable, AMD
>>>>>x86-64, version 1 (SYSV), for GNU/Linux 2.4.1, dynamically linked
>>>>>(uses shared libs), not stripped
>>>>>
>>>>>
>>>>>
>>>>Someone suggested I take out the 'disabled-shared' from the configure
>>>>line, so I did. The result was the same.
>>>>
>>>>So the result is that I can not build a shared library on a 64bit
>>>>linux
>>>>machine that uses orte calls.
>>>>So then I tried taking out the orte calls and instead use MPI calls.
>>>>Sure, this function makes no sense but here it is now:
>>>>
>>>>
>>>>
>>>>
>>>>>#include "orte_config.h"
>>>>>#include <mpi.h>
>>>>>
>>>>>int test_compile(int x) {
>>>>> MPI_Comm_rank(MPI_COMM_WORLD, &x);
>>>>>
>>>>> return x + 1;
>>>>>}
>>>>>
>>>>>
>>>>>
>>>>And now, when I try and make a shared object I get relocation errors:
>>>>
>>>>
>>>>
>>>>
>>>>>/usr/lib64/gcc-lib/x86_64-suse-linux/3.3.3/../../../../x86_64-suse-
>>>>>linux/bin/ld:
>>>>>/home/ndebard/local/ompi/lib/libmpi.a(comm_init.o): relocation
>>>>>R_X86_64_32 can not be used when making a shared object; recompile
>>>>>with -fPIC
>>>>>/home/ndebard/local/ompi/lib/libmpi.a: could not read symbols: Bad
>>>>>value
>>>>>
>>>>>
>>>>>
>>>>So... could perhaps the build be messed up and not be really using
>>>>64bit
>>>>code?
>>>>Am I the only one seeing this? It's a trivial test for those of you
>>>>with access to a 64bit machine if you wouldn't mind testing for me.
>>>>
>>>>Help would be greatly appreciated.
>>>>
>>>>--
>>>>-- Nathan
>>>>Correspondence
>>>>---------------------------------------------------------------------
>>>>Nathan DeBardeleben, Ph.D.
>>>>Los Alamos National Laboratory
>>>>Parallel Tools Team
>>>>High Performance Computing Environments
>>>>phone: 505-667-3428
>>>>email: ndebard_at_[hidden]
>>>>---------------------------------------------------------------------
>>>>
>>>>_______________________________________________
>>>>devel mailing list
>>>>devel_at_[hidden]
>>>>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>
>
>