Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Patrick Jessee (pj_at_[hidden])
Date: 2006-12-08 08:53:00


Hello. For OpenMPI 1.1.2, I've come across a situation where the
--prefix syntax does not seem to be working. I've investigated the
issue by stepping through the mpirun startup in a debugger. Below is a
summary of the problem and details about the investigation (along with a
prospective fix).

Summary of problem
===============

When starting a openMPI run with the --prefix option, the MPI
application does not start up correctly in certain situations. An
important point is that this problem behavior is masked (and not seen)
if the openMPI libraries are available at the compile/install-time
location defined by OPAL_PKGLIBDIR (defined in
opal/include/opal/install_dirs.h). So in debugging the problem, it is
important to move the openMPI installation from the installed location,
and then set the --prefix value to the new location. In addition,
LD_LIBRARY_PATH needs to be set to the new location so mpirun can find
liborte.so and libopal.so at program load time (--prefix can't help
mpirun with liborte.so and libopal.so because (a) these libs are
dynamically linked into mpirun and are needed at program load time, and
(b) the --prefix arg isn't processed until after load time. Thus
LD_LIBRARY_PATH is needed for mpirun, but this is tangential).

The behavior that is see is the following output:

--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

orte_sds_base_select failed
--> Returned value -13 instead of ORTE_SUCCESS
:
:
--------------------------------------------------------------------------
Open RTE was unable to initialize properly. The error occurred while
attempting to orte_init(). Returned value -13 instead of ORTE_SUCCESS.
--------------------------------------------------------------------------

Investigation of the problem
===================

As mentioned before, I've looked at mpirun in the debugger. The
instance of mpirun (and the MPI app) find the dynamically linked
libraries (liborte.so, libopal.so) just fine, but they do not locate the
dynamically loaded ones (the ones in lib/openmpi such as
mca_paffinity_linux.so, etc.). The --prefix directory does not seem to
be getting used to open the libraries in lib/openmpi.

It appears that the location to search is getting set in mca_base_open.c
around line 68 (1.1.2):

asprintf(&value, "%s:~/.openmpi/components", OPAL_PKGLIBDIR);
mca_base_param_component_path =
  mca_base_param_reg_string_name("mca", "component_path",
                                 "Path where to look for Open MPI and
ORTE components",
                                 false, false, value, NULL);

Here, OPAL_PKGLIBDIR is a fixed, compile-time location. It appears that
the --prefix directory (actually <prefix_dir>/lib/openmpi) needs to be
appended, if not prepended, to the component_path. Alternatively, the
static OPAL_PKGLIBDIR directory could just be replaced by the runtime
value of <prefix_dir>/lib/openmpi.

I've compiled in a quick fix to libopal.so to see if the approach
addressed the issue. I didn't see how to get access to the --prefix
directory at this point, so I just prepended genenv("LD_LIBRARY_PATH")
to "value" and added <prefix_dir>/lib/openmpi to LD_LIBRARY_PATH before
starting the app (note: this is just a way for verifying that if the
--prefix directory was used here, it would address the issue; this is
not a proposed solution. The <prefix_dir>/lib/openmpi should be used
directly). Anyway, this fixed the issue and the application was able so
start.

In applying this fix, I also found that is was not only important for
mca_base_param_component_path to include the <prefix_dir>/lib/openmpi
directory in the instances of mpirun and the MPI app, but also in all
instances of orted before they dynamically load libraries.

----
In summary, it seems that this issue can be resolved by applying the 
--prefix directory (<prefix_dir>/lib/openmpi) to 
mca_base_param_component_path in instances of mpirun, orted, and the MPI 
app.
Any help in getting this fix implemented in the code base would be very 
much appreciated, and I'll be happy to provide any more information or 
help.
Regards,
Patrick
P.S.  Even with the fix, a (non-fatal) message is printed.  It's 
probably a tangential issue, but thought it was worth mentioning. Again, 
the --prefix directory probably needs to be used somewhere in place of a 
static directory.  The message is:
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
  rds:no-hostfile
from the file:
  help-rds-hostfile.txt
But I couldn't find any file matching that name.  Sorry!
--------------------------------------------------------------------------