Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Patrick Jessee (pj_at_[hidden])
Date: 2006-12-19 13:10:48


Jeff,

Thanks for the message. I responded to a separate message yesterday,
but will summarize that communication here to be complete.

>How high of a priority is the ability to re-home an OMPI installation
>for you?

It's high in the long term, but it is not urgent for 1.2.

We're very glad it's on the radar. Thanks again, and let me know if I
can provide anymore information.

Regards,

-Patrick

Jeff Squyres wrote:

>Greetings Patrick. Many thanks for the detailed run-down; sorry I
>didn't reply earlier.
>
>This is quite definitely a known problem, and I'm pretty sure we have
>an open ticket on it (I'm on a plane right now and can't check the
>web-based bug tracker). We have a solution in mind for the issue,
>but it hadn't been done yet mainly because it hadn't bubbled up high
>enough in priority / no one had the time to code it up.
>
>How high of a priority is the ability to re-home an OMPI installation
>for you?
>
>
>On Dec 8, 2006, at 8:53 AM, Patrick Jessee wrote:
>
>
>
>>Hello. For OpenMPI 1.1.2, I've come across a situation where the --
>>prefix syntax does not seem to be working. I've investigated the
>>issue by stepping through the mpirun startup in a debugger. Below
>>is a summary of the problem and details about the investigation
>>(along with a prospective fix).
>>
>>Summary of problem
>>===============
>>
>>When starting a openMPI run with the --prefix option, the MPI
>>application does not start up correctly in certain situations. An
>>important point is that this problem behavior is masked (and not
>>seen) if the openMPI libraries are available at the compile/install-
>>time location defined by OPAL_PKGLIBDIR (defined in opal/include/
>>opal/install_dirs.h). So in debugging the problem, it is important
>>to move the openMPI installation from the installed location, and
>>then set the --prefix value to the new location. In addition,
>>LD_LIBRARY_PATH needs to be set to the new location so mpirun can
>>find liborte.so and libopal.so at program load time (--prefix can't
>>help mpirun with liborte.so and libopal.so because (a) these libs
>>are dynamically linked into mpirun and are needed at program load
>>time, and (b) the --prefix arg isn't processed until after load
>>time. Thus LD_LIBRARY_PATH is needed for mpirun, but this is
>>tangential).
>>
>>The behavior that is see is the following output:
>>
>>----------------------------------------------------------------------
>>----
>>It looks like orte_init failed for some reason; your parallel
>>process is
>>likely to abort. There are many reasons that a parallel process can
>>fail during orte_init; some of which are due to configuration or
>>environment problems. This failure appears to be an internal failure;
>>here's some additional information (which may only be relevant to an
>>Open MPI developer):
>>
>>orte_sds_base_select failed
>>--> Returned value -13 instead of ORTE_SUCCESS
>>:
>>:
>>----------------------------------------------------------------------
>>----
>>Open RTE was unable to initialize properly. The error occurred while
>>attempting to orte_init(). Returned value -13 instead of
>>ORTE_SUCCESS.
>>----------------------------------------------------------------------
>>----
>>
>>
>>Investigation of the problem
>>===================
>>
>>As mentioned before, I've looked at mpirun in the debugger. The
>>instance of mpirun (and the MPI app) find the dynamically linked
>>libraries (liborte.so, libopal.so) just fine, but they do not
>>locate the dynamically loaded ones (the ones in lib/openmpi such as
>>mca_paffinity_linux.so, etc.). The --prefix directory does not
>>seem to be getting used to open the libraries in lib/openmpi.
>>
>>It appears that the location to search is getting set in
>>mca_base_open.c around line 68 (1.1.2):
>>
>>asprintf(&value, "%s:~/.openmpi/components", OPAL_PKGLIBDIR);
>>mca_base_param_component_path =
>> mca_base_param_reg_string_name("mca", "component_path",
>> "Path where to look for Open MPI
>>and ORTE components",
>> false, false, value, NULL);
>>
>>
>>Here, OPAL_PKGLIBDIR is a fixed, compile-time location. It appears
>>that the --prefix directory (actually <prefix_dir>/lib/openmpi)
>>needs to be appended, if not prepended, to the component_path.
>>Alternatively, the static OPAL_PKGLIBDIR directory could just be
>>replaced by the runtime value of <prefix_dir>/lib/openmpi.
>>
>>I've compiled in a quick fix to libopal.so to see if the approach
>>addressed the issue. I didn't see how to get access to the --
>>prefix directory at this point, so I just prepended genenv
>>("LD_LIBRARY_PATH") to "value" and added <prefix_dir>/lib/openmpi
>>to LD_LIBRARY_PATH before starting the app (note: this is just a
>>way for verifying that if the --prefix directory was used here, it
>>would address the issue; this is not a proposed solution. The
>><prefix_dir>/lib/openmpi should be used directly). Anyway, this
>>fixed the issue and the application was able so start.
>>
>>In applying this fix, I also found that is was not only important
>>for mca_base_param_component_path to include the <prefix_dir>/lib/
>>openmpi directory in the instances of mpirun and the MPI app, but
>>also in all instances of orted before they dynamically load libraries.
>>----
>>
>>In summary, it seems that this issue can be resolved by applying
>>the --prefix directory (<prefix_dir>/lib/openmpi) to
>>mca_base_param_component_path in instances of mpirun, orted, and
>>the MPI app.
>>
>>Any help in getting this fix implemented in the code base would be
>>very much appreciated, and I'll be happy to provide any more
>>information or help.
>>
>>Regards,
>>
>>Patrick
>>
>>P.S. Even with the fix, a (non-fatal) message is printed. It's
>>probably a tangential issue, but thought it was worth mentioning.
>>Again, the --prefix directory probably needs to be used somewhere
>>in place of a static directory. The message is:
>>
>>----------------------------------------------------------------------
>>----
>>Sorry! You were supposed to get help about:
>> rds:no-hostfile
>>from the file:
>> help-rds-hostfile.txt
>>But I couldn't find any file matching that name. Sorry!
>>----------------------------------------------------------------------
>>----
>><pj.vcf>
>>_______________________________________________
>>devel mailing list
>>devel_at_[hidden]
>>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>
>
>
>