Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Missing Symbol
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2010-03-05 18:26:13


On Mar 5, 2010, at 6:02 PM, Jeff Squyres (jsquyres) wrote:

> I wondered aloud on IM to Terry after your earlier emails if we should just custom-patch ltdl in OMPI to fix this issue. The problem is that libltdl is effectively reporting the "wrong" error back to OMPI, so the error string that we get to print out ends up not being very useful (e.g., not showing which symbol was missing, or what the problem was with the dlopen). Fixing this properly in libltdl is actually somewhat tricky -- which is why it hasn't been fixed yet. But given that OMPI's use of libltdl is pretty specific, we might be able to get away with a simple fix that works just for OMPI (but wouldn't necessarily be suitable for all other libltdl users).

I made a patch for exactly what I described: it comments out the preopen module's setting of FILE_NOT_FOUND. But now I'm getting foiled by the use of RTLD_LAZY. For example, if I add a bogus symbol that can't be resolved into the TCP BTL, I get this when I run ompi_info:

-----
...lots of ompi_info config output...
       MPI_MAX_PORT_NAME: 1024
  MPI_MAX_DATAREP_STRING: 128
dyld: lazy symbol binding failed: Symbol not found: _jeffs_symbol_that_does_not_exist
  Referenced from: /Users/jsquyres/bogus/lib/openmpi/mca_btl_tcp.so
  Expected in: flat namespace
[ ompi_info aborts ]
-----

This is happening because libltdl's dlopen() is being invoked with RTLD_LAZY so the fact that a symbol can't be resolved at dlopen() time is not a problem. It becomes a fatal problem later when the component's open function is invoked and my unresolved symbol is exposed in all of its glory.

If I manually change the LT_LAZY_OR_NOW to RTLD_NOW in the libltdl/loaders/dlopen.c, then I get the behavior I was expecting:

------
...lots of ompi_info config output...
       MPI_MAX_PORT_NAME: 1024
  MPI_MAX_DATAREP_STRING: 128
[rtp-jsquyres-8717.cisco.com:89384] mca: base: component_find: unable to open /Users/jsquyres/bogus/lib/openmpi/mca_btl_tcp: dlopen(/Users/jsquyres/bogus/lib/openmpi/mca_btl_tcp.so, 10): Symbol not found: _jeffs_symbol_that_does_not_exist
  Referenced from: /Users/jsquyres/bogus/lib/openmpi/mca_btl_tcp.so
  Expected in: flat namespace
 in /Users/jsquyres/bogus/lib/openmpi/mca_btl_tcp.so (ignored)
           MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.7)
           MCA paffinity: darwin (MCA v2.0, API v2.0, Component v1.7)
...lots of ompi_info config output...
-----

I.e., the dlopen() fails and my patch causes us to actually get a reasonable error message from libltdl.

So:

1. Given that I'm seeing this on both Linux (RHEL4) and OSX, the LT_LAZY_OR_NOW must be resolving the RTLD_LAZY on both Linux and OSX -- so how are you getting the error message that you're getting? Is your system somehow using RTLD_NOW?

2. If OSX and Linux both use RTLD_LAZY, is my patch useful? I'm hesitant to add it if it's only partially (or not at all) useful...

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/