Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: George Bosilca (bosilca_at_[hidden])
Date: 2007-07-19 10:35:21


It wasn't a bug. There is a bunch of code there just to make sure
PATH and LD_LIBRARY_PATH are set correctly.

Yesterday we discovered that even if you force the --prefix in a
similar execution environment the LD_LIBRARY_PATH doesn't get set.
However, using localhost always solve the problem.

   george.

On Jul 19, 2007, at 10:18 AM, Gleb Natapov wrote:

> On Thu, Jul 19, 2007 at 08:07:51AM -0600, Ralph H Castain wrote:
>> Interesting. Apparently, it is getting a NULL back when it tries
>> to access
>> the LD_LIBRARY_PATH in your environment. Here is the code involved:
>>
>> newenv = opal_os_path( false, prefix_dir, lib_base, NULL );
>> oldenv = getenv("LD_LIBRARY_PATH");
>> if (NULL != oldenv) {
>> char* temp;
>> asprintf(&temp, "%s:%s", newenv, oldenv);
>> free(newenv);
>> newenv = temp;
>> }
>> opal_setenv("LD_LIBRARY_PATH", newenv, true, &env);
>> if (mca_pls_rsh_component.debug) {
>> opal_output(0, "pls:rsh: reset LD_LIBRARY_PATH: %s",
>> newenv);
>> }
>> free(newenv);
>>
>> So you can see that the only way we can get your debugging output
>> is for the
>> LD_LIBRARY_PATH in your starting environment to be NULL. Note that
>> this
>> comes after we fork, so we are talking about the child process -
>> not sure
>> that matters, but may as well point it out.
>>
>> So the question is: why do you not have LD_LIBRARY_PATH set in your
>> environment when you provide a different hostname?
> Right I don't have LD_LIBRARY_PATH set in my environment, but I expect
> that mpirun will provide working environment for all ranks not just
> remote ones. This is how it worked before. Perhaps that was a bug, but
> this was useful bug :)
>
>>
>>
>> On 7/19/07 7:45 AM, "Gleb Natapov" <glebn_at_[hidden]> wrote:
>>
>>> On Wed, Jul 18, 2007 at 09:08:38PM +0300, Gleb Natapov wrote:
>>>> On Wed, Jul 18, 2007 at 09:08:47AM -0600, Ralph H Castain wrote:
>>>>> But this will lockup:
>>>>>
>>>>> pn1180961:~/openmpi/trunk rhc$ mpirun -n 1 -host pn1180961
>>>>> printenv | grep
>>>>> LD
>>>>>
>>>>> The reason is that the hostname in this last command doesn't
>>>>> match the
>>>>> hostname I get when I query my interfaces, so mpirun thinks it
>>>>> must be a
>>>>> remote host - and so we stick in ssh until that times out.
>>>>> Which could be
>>>>> quick on your machine, but takes awhile for me.
>>>>>
>>>> This is not my case. mpirun resolves hostname and runs env but
>>>> LD_LIBRARY_PATH is not there. If I use full name like this
>>>> # /home/glebn/openmpi/bin/mpirun -np 1 -H elfit1.voltaire.com
>>>> env | grep
>>>> LD_LIBRARY_PATH
>>>> LD_LIBRARY_PATH=/home/glebn/openmpi/lib
>>>>
>>>> everything is OK.
>>>>
>>> More info. If I provide hostname to mpirun as returned by command
>>> "hostname" the LD_LIBRARY_PATH is not set:
>>> # /home/glebn/openmpi/bin/mpirun -np 1 -H `hostname` env | grep LD
>>> OLDPWD=/home/glebn/OpenMPI/ompi-tests/intel_tests
>>>
>>> if I provide any other name that resolves to the same IP then
>>> LD_LIBRARY_PATH is set.
>>> # /home/glebn/openmpi/bin/mpirun -np 1 -H localhost env | grep LD
>>> OLDPWD=/home/glebn/OpenMPI/ompi-tests/intel_tests
>>> LD_LIBRARY_PATH=/home/glebn/openmpi/lib
>>>
>>> Here is debug output of "bad" run:
>>> /home/glebn/openmpi/bin/mpirun -np 1 -H `hostname` -mca
>>> pls_rsh_debug 1 echo
>>> [elfit1:14730] pls:rsh: launching job 1
>>> [elfit1:14730] pls:rsh: no new daemons to launch
>>>
>>> Here is good one:
>>> /home/glebn/openmpi/bin/mpirun -np 1 -H localhost -mca
>>> pls_rsh_debug 1 echo
>>> [elfit1:14752] pls:rsh: launching job 1
>>> [elfit1:14752] pls:rsh: local csh: 0, local sh: 1
>>> [elfit1:14752] pls:rsh: assuming same remote shell as local shell
>>> [elfit1:14752] pls:rsh: remote csh: 0, remote sh: 1
>>> [elfit1:14752] pls:rsh: final template argv:
>>> [elfit1:14752] pls:rsh: /usr/bin/ssh <template> orted --name
>>> <template>
>>> --num_procs 1 --vpid_start 0 --nodename <template> --universe
>>> root_at_elfit1:default-universe-14752 --nsreplica
>>> "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" --
>>> gprreplica
>>> "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" -mca
>>> mca_base_param_file_path
>>> /home/glebn/openmpi//share/openmpi/amca-param-sets:/home/USERS/
>>> glebn/openmpiwd
>>> -mca mca_base_param_file_path_force /home/USERS/glebn/openmpiwd
>>> [elfit1:14752] pls:rsh: launching on node localhost
>>> [elfit1:14752] pls:rsh: localhost is a LOCAL node
>>> [elfit1:14752] pls:rsh: reset PATH:
>>> /home/glebn/openmpi/bin:/home/USERS/lenny/MPI/mpi/bin:/opt/vltmpi/
>>> OPENIB/mpi/b
>>> in:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/
>>> local/bin:/sbin:/
>>> bin:/usr/sbin:/usr/bin:/root/bin
>>> [elfit1:14752] pls:rsh: reset LD_LIBRARY_PATH: /home/glebn/
>>> openmpi/lib
>>> [elfit1:14752] pls:rsh: changing to directory /root
>>> [elfit1:14752] pls:rsh: executing: (/home/glebn/openmpi/bin/
>>> orted) [orted
>>> --name 0.0.1 --num_procs 1 --vpid_start 0 --nodename localhost --
>>> universe
>>> root_at_elfit1:default-universe-14752 --nsreplica
>>> "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" --
>>> gprreplica
>>> "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" -mca
>>> mca_base_param_file_path
>>> /home/glebn/openmpi//share/openmpi/amca-param-sets:/home/USERS/
>>> glebn/openmpiwd
>>> -mca mca_base_param_file_path_force /home/USERS/glebn/openmpiwd --
>>> set-sid]
>>>
>>> --
>>> Gleb.
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> --
> Gleb.
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel