Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-09-15 15:32:21


Followup for the list... a bit of explanation of Nathan's problem about
shared libraries and unresolved symbols.

Short version:
--------------

It's an OMPI bug when built as a shared library (not an issue for
static libraries). The fix is straightforward, but involves grunt
work. I'll try to get a student to do it RSN.

Long version:
-------------

What's happening is that we are not linking OMPI components against the
opal/orte/ompi libraries. As such, we are exploiting the fact that
when they are dlopened by a standalone application (e.g., a.out), the
Libtool portable version of dlopen() exports all the symbols from the
parent process such that the child can find and use them at run-time to
resolve any unknown symbols. Here's an example (I'm leaving out some
fine-grained details, and it's slightly different on different OS's,
but this is "true enough" for the purposes of this thread):

- a.out, which was linked against libopal.so (and friends), launches
- the linker runs into an unresolved symbol
- the linker sees that that symbols is supposed to be in "libopal.so",
and starts searching LD_LIBRARY_PATH for it
- the linker finds libopal.so, loads it, and is able to resolve the
symbol

It gets interesting at this part:

- within MPI_Init()/orte_init()/opal_init() (i.e., however you
initialized yourself to OMPI/ORTE/OPAL), we use the Libtool portable
dlopen() to open our components
- the components will have unresolved symbols as well (i.e., the
symbols in libopal, liborte, and libmpi)
- when the linker hits these, it tries to resolve them.
- first, the linker looks in the public namespace of the process, and
if it finds the symbols there, it's done
- in this case, libopal (and friends) have already been loaded in the
process, so the linker can find the symbols right away -- without
loading any additional libraries

This is the scheme that we were relying on for libopal/orte/ompi
symbols to be resolved in our components. And for standalone
executables, it works fine.

But for an environment like Eclipse, it doesn't.

I don't know anything about Eclipse, but I'm assuming that it does
something similar to our component system -- it dlopen's them. However
-- here's where my guess comes in -- it doesn't make all the symbols in
the opened component be in the public namespace of the process (this is
different than what OMPI does, for various reasons). Hence, if you
build an Eclipse component against OMPI, the Eclipse component will be
dynamically linked against libopal (etc.). So when Eclipse loads in
your component, similar to the standalone executable example above, the
linker will realize that it has unresolved symbols and will use the
normal mechanism to resolve them (e.g., look for libopal.so in
LD_LIBRARY_PATH).

The problem comes in when we dlopen OMPI/ORTE/OPAL components.

Our scheme assumed that we'd be able to find the opal/orte/ompi symbols
in the public namespace of the parent process. But they're not --
Eclipse loaded the component in a private namespace, and therefore all
the opal/orte/ompi symbols are in that private namespace. And
therefore the OMPI/ORTE/OPAL components can't find the symbols, and the
linker barfs.

The solution is to change our scheme in OMPI a bit. We just need to
add a few lines to all the component Makefile.am's to, in the dynamic
case, link the components against their relevant libraries (opal
components linked against libopal, orte components linked against
liborte and libopal, etc.). This does not make the components
significantly larger -- it just adds an entry into the dynamic linker
section of the component's resulting .so file indicating "if you have
unresolved components, go look in libopal.so" (etc.).

This allows the components themselves to pull in shared libraries when
they are dlopened -- if they need to. If the symbols can be resolved
in the parent process' public symbol namespace, they still will be (as
in the standalone executable example, above). But if they can't be
resolved that way, this gives the ability to explicitly find and pull
in a shared library and resolve the symbols that way (as in the Eclipse
plugin example, above).

Aren't computers fun? :-)

On Sep 14, 2005, at 12:47 PM, Nathan DeBardeleben wrote:

> Let me explain what I'm doing real quickly.
>
> I have a piece of Java code which is calling OMPI calls. It's doing
> this through JNI (java native interface). Don't worry, you don't have
> to understand Java to try and help me here. The JNI code is C with
> some funky macros in it provided by Java.
>
> I have to compile the JNI C code into a shared library and then the
> Java code will load it dynamically when that class is instantiated.
>
> So - here's my compile line:
>
>> [sparkplug]~/<2>ompi > mpicc -I /usr/java/jdk1.5.0_04/include -I
>> /usr/java/jdk1.5.0_04/include/linux -c ptp_ompi_jni.c -fPIC
>> [sparkplug]~/<2>ompi > mpicc -I
>> /usr/java/jdk1.5.0_04/include -I /usr/java/jdk1.5.0_04/include/linux
>> -shared -o libptp_ompi_jni.so ptp_ompi_jni.o
>
> I then have libptp_ompi_jni.so. I then load that from within Java.
> If I setup my LD_LIBRARY_PATH and some args to the Java VM correctly,
> then it finds the above library and loads it up. OK - all fine so
> far.
>
> However, when I call 'orte_init()' it craps out with the following
> error:
>
>> /usr/java/jdk1.5.0_04/bin/java: error while loading shared libraries:
>> /home/ndebard/local/ompi/lib/openmpi/mca_paffinity_linux.so:
>> undefined symbol: mca_base_param_reg_int
>
> So I went digging in mca_paffinity_linux.so looking for that symbol.
>
>> [sparkplug]~/<3>openmpi > nm mca_paffinity_linux.so | grep
>> mca_base_param_reg
>> U mca_base_param_reg_int
>> [sparkplug]~/<3>openmpi >
>
> OK. So it's undefined in that .so.
> I'm really not a library guy (can't you tell from my myriad of
> mails?). What does this mean? I went back digging in the parent
> directory, /home/ndebard/local/ompi/lib, to find the symbol.
>
>> [sparkplug]~/<2>lib > nm libopal.so | grep mca_base_param_reg_int
>> 000000000001ce00 T mca_base_param_reg_int
>> 000000000001cea3 T mca_base_param_reg_int_name
>> [sparkplug]~/<2>lib >
>
> OK so I read this as it's defined in opal.so.
> Do you have any idea why my JNI library is trying to load
> mca_paffinity_linux.so?
> I went back to my compile line and added -lopal -lmpi -lorte just in
> case, but that didn't help.
>
> Again, Jeff, I know this isn't really your concern (unless you want a
> wicked OMPI graphical demo at SC!) :) but I wanted to drop it out
> there in case you had any insight. I'm kinda stumped on this one.
>
> Does it mean my ompi compile is bad?
>
> -- Nathan
> Correspondence
> ---------------------------------------------------------------------
> Nathan DeBardeleben, Ph.D.
> Los Alamos National Laboratory
> Parallel Tools Team
> High Performance Computing Environments
> phone: 505-667-3428
> email: ndebard_at_[hidden]
> ---------------------------------------------------------------------
>
>
>
> Jeff Squyres wrote:
>
>> Maybe I'm dense -- I thought you couldn't use --shared when linking
>> to a static library...?
>>
>> If you want to build OMPI as a shared library, then ditch the
>> --enable-static --disable-shared from your configure line (building
>> OMPI as shared is the default, which is how I build 95% of the time).
>>
>>
>>
>> On Sep 12, 2005, at 5:47 PM, Nathan DeBardeleben wrote:
>>
>>
>>> I've been having this problem for a week or so and I've been asking
>>> other people to weigh in if they know what I'm doing wrong. I've
>>> gotten
>>> no where on this so I figure I'll finally drop it out on the list.
>>> First, here's the important info:
>>> The machine:
>>>
>>>
>>>> [sparkplug]~ > cat /etc/issue
>>>>
>>>> Welcome to SuSE Linux 9.1 (x86-64) - Kernel \r (\l).
>>>>
>>>>
>>>> [sparkplug]~ > uname -a
>>>> Linux sparkplug 2.6.10 #4 SMP Wed Jan 26 11:50:00 MST 2005 x86_64
>>>> x86_64 x86_64 GNU/Linux
>>>>
>>> My versions of libtool, autoconf, automake:
>>>
>>>
>>>> [sparkplug]~ > libtool --version
>>>> ltmain.sh (GNU libtool) 1.5.20 (1.1220.2.287 2005/08/31 18:54:15)
>>>>
>>>> Copyright (C) 2005 Free Software Foundation, Inc.
>>>> This is free software; see the source for copying conditions.
>>>> There is NO
>>>> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
>>>> PURPOSE.
>>>> [sparkplug]~ > autoconf --version
>>>> autoconf (GNU Autoconf) 2.59
>>>> Written by David J. MacKenzie and Akim Demaille.
>>>>
>>>> Copyright (C) 2003 Free Software Foundation, Inc.
>>>> This is free software; see the source for copying conditions.
>>>> There is NO
>>>> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
>>>> PURPOSE.
>>>> [sparkplug]~ > automake --version
>>>> automake (GNU automake) 1.8.5
>>>> Written by Tom Tromey <tromey_at_[hidden]>.
>>>>
>>>> Copyright 2004 Free Software Foundation, Inc.
>>>> This is free software; see the source for copying conditions.
>>>> There is NO
>>>> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
>>>> PURPOSE.
>>>> [sparkplug]~ >
>>>>
>>> My ompi version: 7322 - but this has been going on for a few days
>>> like I
>>> said and I've been updating a lot, with no progress.
>>>
>>> Configured using:
>>>
>>>
>>>> $ ./configure --enable-static --disable-shared --without-threads
>>>> --prefix=/home/ndebard/local/ompi --with-devel-headers
>>>> --enable-mca-no-build=ptl-gm
>>>>
>>> Simple C file which I will compile into a shared library:
>>>
>>>
>>>> int test_compile(int x) {
>>>> int rc;
>>>>
>>>> rc = orte_init(true);
>>>> printf("rc = %d\n", rc);
>>>>
>>>> return x + 1;
>>>> }
>>>>
>>> Above file is named 'testlib.c'
>>>
>>> OK, so let's build this:
>>>
>>>
>>>> [sparkplug]~/ompi-test > mpicc -c testlib.c
>>>> [sparkplug]~/ompi-test > mpicc -shared -o libtestlib.so testlib.o
>>>> /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.3/../../../../x86_64-suse-
>>>> linux/bin/ld:
>>>> testlib.o: relocation R_X86_64_32 can not be used when making a
>>>> shared
>>>> object; recompile with -fPIC
>>>> testlib.o: could not read symbols: Bad value
>>>> collect2: ld returned 1 exit status
>>>>
>>> OK so relocation problems. Maybe I'll follow the directions and
>>> -fPIC
>>> my file myself:
>>>
>>>
>>>> [sparkplug]~/ompi-test > mpicc -c testlib.c -fPIC
>>>> [sparkplug]~/ompi-test > mpicc -shared -o libtestlib.so testlib.o
>>>> /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.3/../../../../x86_64-suse-
>>>> linux/bin/ld:
>>>> /home/ndebard/local/ompi/lib/liborte.a(orte_init.o): relocation
>>>> R_X86_64_32 can not be used when making a shared object; recompile
>>>> with -fPIC
>>>> /home/ndebard/local/ompi/lib/liborte.a: could not read symbols: Bad
>>>> value
>>>> collect2: ld returned 1 exit status
>>>>
>>> OK so I read this as there's a relocation problem in 'liborte.a'. I
>>> un-arred liborte.a and checked some of the files with 'file' and it
>>> says
>>> 64bit. I havn't yet written a script to check every file in here,
>>> but
>>> here's orte_init.o:
>>>
>>>
>>>> [sparkplug]~/<1>tmp > file orte_init.o
>>>> orte_init.o: ELF 64-bit LSB relocatable, AMD x86-64, version 1
>>>> (SYSV),
>>>> not stripped
>>>>
>>> So that at least says it's 64bit.
>>> And to confirm, my mpicc's 64bit too:
>>>
>>>
>>>> [sparkplug]~/<1>tmp > which mpicc
>>>> /home/ndebard/local/ompi/bin/mpicc
>>>> [sparkplug]~/<1>tmp > file /home/ndebard/local/ompi/bin/mpicc
>>>> /home/ndebard/local/ompi/bin/mpicc: ELF 64-bit LSB executable, AMD
>>>> x86-64, version 1 (SYSV), for GNU/Linux 2.4.1, dynamically linked
>>>> (uses shared libs), not stripped
>>>>
>>> Someone suggested I take out the 'disabled-shared' from the configure
>>> line, so I did. The result was the same.
>>>
>>> So the result is that I can not build a shared library on a 64bit
>>> linux
>>> machine that uses orte calls.
>>> So then I tried taking out the orte calls and instead use MPI calls.
>>> Sure, this function makes no sense but here it is now:
>>>
>>>
>>>> #include "orte_config.h"
>>>> #include <mpi.h>
>>>>
>>>> int test_compile(int x) {
>>>> MPI_Comm_rank(MPI_COMM_WORLD, &x);
>>>>
>>>> return x + 1;
>>>> }
>>>>
>>> And now, when I try and make a shared object I get relocation errors:
>>>
>>>
>>>> /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.3/../../../../x86_64-suse-
>>>> linux/bin/ld:
>>>> /home/ndebard/local/ompi/lib/libmpi.a(comm_init.o): relocation
>>>> R_X86_64_32 can not be used when making a shared object; recompile
>>>> with -fPIC
>>>> /home/ndebard/local/ompi/lib/libmpi.a: could not read symbols: Bad
>>>> value
>>>>
>>> So... could perhaps the build be messed up and not be really using
>>> 64bit
>>> code?
>>> Am I the only one seeing this? It's a trivial test for those of you
>>> with access to a 64bit machine if you wouldn't mind testing for me.
>>>
>>> Help would be greatly appreciated.
>>>
>>> --
>>> -- Nathan
>>> Correspondence
>>> ---------------------------------------------------------------------
>>> Nathan DeBardeleben, Ph.D.
>>> Los Alamos National Laboratory
>>> Parallel Tools Team
>>> High Performance Computing Environments
>>> phone: 505-667-3428
>>> email: ndebard_at_[hidden]
>>> ---------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>
>>
>>

-- 
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/