Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2006-12-15 19:16:36


Greetings Patrick. Many thanks for the detailed run-down; sorry I
didn't reply earlier.

This is quite definitely a known problem, and I'm pretty sure we have
an open ticket on it (I'm on a plane right now and can't check the
web-based bug tracker). We have a solution in mind for the issue,
but it hadn't been done yet mainly because it hadn't bubbled up high
enough in priority / no one had the time to code it up.

How high of a priority is the ability to re-home an OMPI installation
for you?

On Dec 8, 2006, at 8:53 AM, Patrick Jessee wrote:

>
> Hello. For OpenMPI 1.1.2, I've come across a situation where the --
> prefix syntax does not seem to be working. I've investigated the
> issue by stepping through the mpirun startup in a debugger. Below
> is a summary of the problem and details about the investigation
> (along with a prospective fix).
>
> Summary of problem
> ===============
>
> When starting a openMPI run with the --prefix option, the MPI
> application does not start up correctly in certain situations. An
> important point is that this problem behavior is masked (and not
> seen) if the openMPI libraries are available at the compile/install-
> time location defined by OPAL_PKGLIBDIR (defined in opal/include/
> opal/install_dirs.h). So in debugging the problem, it is important
> to move the openMPI installation from the installed location, and
> then set the --prefix value to the new location. In addition,
> LD_LIBRARY_PATH needs to be set to the new location so mpirun can
> find liborte.so and libopal.so at program load time (--prefix can't
> help mpirun with liborte.so and libopal.so because (a) these libs
> are dynamically linked into mpirun and are needed at program load
> time, and (b) the --prefix arg isn't processed until after load
> time. Thus LD_LIBRARY_PATH is needed for mpirun, but this is
> tangential).
>
> The behavior that is see is the following output:
>
> ----------------------------------------------------------------------
> ----
> It looks like orte_init failed for some reason; your parallel
> process is
> likely to abort. There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems. This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
> orte_sds_base_select failed
> --> Returned value -13 instead of ORTE_SUCCESS
> :
> :
> ----------------------------------------------------------------------
> ----
> Open RTE was unable to initialize properly. The error occurred while
> attempting to orte_init(). Returned value -13 instead of
> ORTE_SUCCESS.
> ----------------------------------------------------------------------
> ----
>
>
> Investigation of the problem
> ===================
>
> As mentioned before, I've looked at mpirun in the debugger. The
> instance of mpirun (and the MPI app) find the dynamically linked
> libraries (liborte.so, libopal.so) just fine, but they do not
> locate the dynamically loaded ones (the ones in lib/openmpi such as
> mca_paffinity_linux.so, etc.). The --prefix directory does not
> seem to be getting used to open the libraries in lib/openmpi.
>
> It appears that the location to search is getting set in
> mca_base_open.c around line 68 (1.1.2):
>
> asprintf(&value, "%s:~/.openmpi/components", OPAL_PKGLIBDIR);
> mca_base_param_component_path =
> mca_base_param_reg_string_name("mca", "component_path",
> "Path where to look for Open MPI
> and ORTE components",
> false, false, value, NULL);
>
>
> Here, OPAL_PKGLIBDIR is a fixed, compile-time location. It appears
> that the --prefix directory (actually <prefix_dir>/lib/openmpi)
> needs to be appended, if not prepended, to the component_path.
> Alternatively, the static OPAL_PKGLIBDIR directory could just be
> replaced by the runtime value of <prefix_dir>/lib/openmpi.
>
> I've compiled in a quick fix to libopal.so to see if the approach
> addressed the issue. I didn't see how to get access to the --
> prefix directory at this point, so I just prepended genenv
> ("LD_LIBRARY_PATH") to "value" and added <prefix_dir>/lib/openmpi
> to LD_LIBRARY_PATH before starting the app (note: this is just a
> way for verifying that if the --prefix directory was used here, it
> would address the issue; this is not a proposed solution. The
> <prefix_dir>/lib/openmpi should be used directly). Anyway, this
> fixed the issue and the application was able so start.
>
> In applying this fix, I also found that is was not only important
> for mca_base_param_component_path to include the <prefix_dir>/lib/
> openmpi directory in the instances of mpirun and the MPI app, but
> also in all instances of orted before they dynamically load libraries.
> ----
>
> In summary, it seems that this issue can be resolved by applying
> the --prefix directory (<prefix_dir>/lib/openmpi) to
> mca_base_param_component_path in instances of mpirun, orted, and
> the MPI app.
>
> Any help in getting this fix implemented in the code base would be
> very much appreciated, and I'll be happy to provide any more
> information or help.
>
> Regards,
>
> Patrick
>
> P.S. Even with the fix, a (non-fatal) message is printed. It's
> probably a tangential issue, but thought it was worth mentioning.
> Again, the --prefix directory probably needs to be used somewhere
> in place of a static directory. The message is:
>
> ----------------------------------------------------------------------
> ----
> Sorry! You were supposed to get help about:
> rds:no-hostfile
> from the file:
> help-rds-hostfile.txt
> But I couldn't find any file matching that name. Sorry!
> ----------------------------------------------------------------------
> ----
> <pj.vcf>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems