On Apr 24, 2010, at 10:14 PM, Nev wrote:
> void * const result = dlopen(libName, RTLD_LAZY | RTLD_LOCAL);
This line is the problem: change RTLD_LOCAL to RTLD_GLOBAL and it'll work. There's another option, too -- keep reading...
<highly complex linker voodoo>
Before discussing why this happens, know that Open MPI plugins call functions back up in the main Open MPI libraries. As a crass-and-not-really-correct-but-close-enough example, consider that OMPI plugins are created (sorta) like this:
gcc my_plugin_source.c ... -L<dir> -lmpi --shared -o mca_framework_component.so
where libmpi.so is a shared library. These plugins are not making MPI standardized API function calls; they're calling internal functions inside libmpi.so (i.e., OMPI's internal implementation API). This is because libmpi.so (and friends) have a whole lotta infrastructure that the plugins need in order to be able to do their work.
It's a fun use of the intelligence of linkers -- a normal MPI app is linked against OMPI's libmpi.so, but so is mca_framework_component.so. When your app calls MPI_Init, the normal run-time linker semantics take over, resolve the symbol, and then call it. Later, mca_framework_component.so is dlopen()'ed. The run-time linker sees that it needs libmpi.so, but realizes that libmpi.so is already loaded -- so it doesn't load it again. When mca_framework_component.so calls OMPI_do_something(), the same run-time resolution occurs, and (this is key) it calls the function in the same instance of libmpi.so that your app is using.
Nifty. Without this concept, OMPI's plugin concept wouldn't work.
Your code is dlopening liba2lib as LOCAL. The run-time linker pulls in libmpi.so at the same time as liba2lib (because MPI_Init needs it) -- and therefore libmpi.so is loaded into the same private space as liba2lib. But then later, the innards of Open MPI dlopen() mca_framework_component.so. This plugin is loaded into a DIFFERENT symbol space than libmpi.so. The key point here is that LOCAL is not "inherited", so to speak. If you dlopen() libfoo as LOCAL, if libfoo then dlopen()s more DSOs, those newly-opened DSOs are in a different space than libfoo.
The best I can guess is that when mca_framework_component.so is dlopen()'ed, the linker says "ya, we have libmpi.so loaded" and it allows the load to complete successfully. But later when it tries to actually resolve OMPI_do_something(), it fails -- because OMPI_do_something() is in the private/LOCAL symbol space. And therefore OMPI_do_something has a value of 0. And it segv's when we try to call through it. (this paragraph may not be exactly right; but it's probably close -- every time I think I understand linkers, I find out that I don't understand them at all...)
It works for you in the static case because Open MPI slurps up all the components *into* libmpi.so in that case. Hence, all the components *and* all the internal libmpi symbols are loaded into the same LOCAL symbol space. There's no dlopen'ing of plugins in this case. And it all works fine because everything can resolve nicely, yadda yadda yadda.
So I think your options are 1) to change that LOCAL to GLOBAL, 2) use "--enable-static --disable-shared", or 3) use --disable-dlopen. #2 builds libmpi.a *and* slurps all of OMPI's components up into libmpi.a. #3 builds libmpi.so *and* slurps all of OMPI's components up into libmpi.so. So you get the benefits of a shared library, but all the components are physically inside libmpi.so as opposed to being standalone DSO's.
</highly complex linker voodoo>
I hope that made sense!
For corporate legal information go to: