Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2010-05-15 08:12:49


Sorry for the delay in replying.

I think that the issue here is the well-known libltdl "reporting the wrong error message" issue.

Specifically, sometimes libltdl fails to load a DSO for a good reason, but then libltdl fails to report the right reason as to why it failed to load the DSO. Open MPI uses the function ld_dlerror() to get a printable string reason for why a DSO fails to load. But sometimes that string reason is *wrong* (i.e., the DSO didn't load, but the reason OMPI printed out as to *why* it didn't load is incorrect). And therefore what OMPI prints out is misleading, at best.

Over time, we have tried two things to make this error message better:

1. When we detect the "wrong" error message (i.e., if lt_dlerror() returns "file not found"), we actually use stat() to check for the presence of the file we were trying to open. If we find the file, then we don't print the lt_dlerror(), but instead print the message you see:

[europa.ecs.vuw.ac.nz:09687] mca: base: component_find: unable to open
/usr/pkg/lib/openmpi/mca_carto_auto_detect: perhaps a missing symbol, or
compiled for a different version of Open MPI? (ignored)

So the error message is at least *somewhat* better than a totally misleading "file not found" message -- but it still only speculates on the real reason that libltdl failed to load the DSO.

2. https://svn.open-mpi.org/trac/ompi/changeset/22806 put in an OMPI-specific change to libltdl that avoids the incorrect error message altogether. So now OMPI should print out the *real* reason libltdl failed to load the DSO.

It does not look like this patch made it over into the v1.4 series; it is awaiting review before it moves to the v1.5 branch (https://svn.open-mpi.org/trac/ompi/ticket/2337).

Hope that all made sense!

-----

Now, all this being said, IIRC (and I very well may not!), the real underlying issue here is that R is dlopening libmpi.so, which, in turn, is dlopening its own DSOs. Given the global linker scoping issues, OMPI's DSOs are unable to find the symbols they need to resolve in the process (because libmpi.so's was opened in a private scope).

This probably is unfortunately larger than us (Open MPI) -- it's really a POSIX issue. What would be ideal is if different linker namespaces could be something more fine-grained than "global" or "private" within a process. E.g., if the private namespace of libmpi.so in the process could selectively make its symbol namespace available to the DSOs that it dlopens. Right now, the only option libmpi.so has is to be opened with a public scope, which somewhat defeats the point of private scoping.

Have you tried building Open MPI with the --disable-dlopen configure flag? This will slurp all of OMPI's DSOs up into libmpi.so -- so there's no dlopening at run-time. Hence, your app (R) can dlopen libmpi.so, but then libmpi.so doesn't dlopen anything else -- all of OMPI's plugins are physically located in libmpi.so.

On May 11, 2010, at 8:33 PM, <Kevin.Buckley_at_[hidden]> <Kevin.Buckley_at_[hidden]> wrote:

>
> > Which libltdl version is that NetBSD ltdl.h from? Which version is
> > in opal/libltdl? Have you tried not doing the above change?
> >
> > libltdl 2.2.x has incompatible changes over 1.5.x, both in the library
> > as well as in the header, as well as (I think) in preloaded modules.
>
> Hey Ralf,
>
> The libtool distinfo file implies NetBSD currently uses libtool-2.2.6b.
>
> An ldd of mpirun shows -lltdl.7 => /usr/pkg/lib/libltdl.so.7
>
>
> I do need to attempt a build of 1.4.2 here in ECS, so I'll try
> building without the patches but I seem to recall that if those
> libtool-related patches
>
> opal/Makefile.in
> configure
> opal/mca/base/mca_base_component_find.c
> opal/mca/base/mca_base_component_repository.c
> test/support/components.h
> test/support/components.c
>
> were not applied, it did not even build. But we'll see.
>
>
> And if you are reading this, Alexsej, have you,as the real
> "OpenMPI on NetBSD" man, built a 1.4.2 as yet ?
>
> Kevin
>
> --
> Kevin M. Buckley Room: CO327
> School of Engineering and Phone: +64 4 463 5971
> Computer Science
> Victoria University of Wellington
> New Zealand
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/