Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Ralph H Castain (rhc_at_[hidden])
Date: 2007-07-19 11:39:24


Talked with Brian and we have identified the problem and a fix - will come
in later today.

Thanks
Ralph

On 7/19/07 9:24 AM, "Ralph H Castain" <rhc_at_[hidden]> wrote:

> You are correct - I misread the note. My bad.
>
> I'll look at how we might ensure the LD_LIBRARY_PATH shows up correctly -
> shouldn't be a big deal.
>
>
> On 7/19/07 9:12 AM, "George Bosilca" <bosilca_at_[hidden]> wrote:
>
>> The second execution (the one that you make reference to) is the one
>> that works fine. The failing one is the first one, where
>> LD_LIBRARY_PATH is not provided. As Gleb indicate using localhost
>> make the problem vanish.
>>
>> george.
>>
>> On Jul 19, 2007, at 10:57 AM, Ralph H Castain wrote:
>>
>>> But it *does* provide an LD_LIBRARY_PATH that is pointing to your
>>> openmpi
>>> installation - it says it did it right here in your debug output:
>>>
>>>>>> [elfit1:14752] pls:rsh: reset LD_LIBRARY_PATH: /home/glebn/
>>>>>> openmpi/lib
>>>
>>> I suspect that the problem isn't in the launcher, but rather in the
>>> iof
>>> again. Why don't we wait until those fixes come into the trunk before
>>> chasing our tails any further?
>>>
>>>
>>> On 7/19/07 8:18 AM, "Gleb Natapov" <glebn_at_[hidden]> wrote:
>>>
>>>> On Thu, Jul 19, 2007 at 08:07:51AM -0600, Ralph H Castain wrote:
>>>>> Interesting. Apparently, it is getting a NULL back when it tries
>>>>> to access
>>>>> the LD_LIBRARY_PATH in your environment. Here is the code involved:
>>>>>
>>>>> newenv = opal_os_path( false, prefix_dir, lib_base, NULL );
>>>>> oldenv = getenv("LD_LIBRARY_PATH");
>>>>> if (NULL != oldenv) {
>>>>> char* temp;
>>>>> asprintf(&temp, "%s:%s", newenv, oldenv);
>>>>> free(newenv);
>>>>> newenv = temp;
>>>>> }
>>>>> opal_setenv("LD_LIBRARY_PATH", newenv, true, &env);
>>>>> if (mca_pls_rsh_component.debug) {
>>>>> opal_output(0, "pls:rsh: reset LD_LIBRARY_PATH: %s",
>>>>> newenv);
>>>>> }
>>>>> free(newenv);
>>>>>
>>>>> So you can see that the only way we can get your debugging output
>>>>> is for the
>>>>> LD_LIBRARY_PATH in your starting environment to be NULL. Note
>>>>> that this
>>>>> comes after we fork, so we are talking about the child process -
>>>>> not sure
>>>>> that matters, but may as well point it out.
>>>>>
>>>>> So the question is: why do you not have LD_LIBRARY_PATH set in your
>>>>> environment when you provide a different hostname?
>>>> Right I don't have LD_LIBRARY_PATH set in my environment, but I
>>>> expect
>>>> that mpirun will provide working environment for all ranks not just
>>>> remote ones. This is how it worked before. Perhaps that was a bug,
>>>> but
>>>> this was useful bug :)
>>>>
>>>>>
>>>>>
>>>>> On 7/19/07 7:45 AM, "Gleb Natapov" <glebn_at_[hidden]> wrote:
>>>>>
>>>>>> On Wed, Jul 18, 2007 at 09:08:38PM +0300, Gleb Natapov wrote:
>>>>>>> On Wed, Jul 18, 2007 at 09:08:47AM -0600, Ralph H Castain wrote:
>>>>>>>> But this will lockup:
>>>>>>>>
>>>>>>>> pn1180961:~/openmpi/trunk rhc$ mpirun -n 1 -host pn1180961
>>>>>>>> printenv | grep
>>>>>>>> LD
>>>>>>>>
>>>>>>>> The reason is that the hostname in this last command doesn't
>>>>>>>> match the
>>>>>>>> hostname I get when I query my interfaces, so mpirun thinks it
>>>>>>>> must be a
>>>>>>>> remote host - and so we stick in ssh until that times out.
>>>>>>>> Which could be
>>>>>>>> quick on your machine, but takes awhile for me.
>>>>>>>>
>>>>>>> This is not my case. mpirun resolves hostname and runs env but
>>>>>>> LD_LIBRARY_PATH is not there. If I use full name like this
>>>>>>> # /home/glebn/openmpi/bin/mpirun -np 1 -H elfit1.voltaire.com
>>>>>>> env | grep
>>>>>>> LD_LIBRARY_PATH
>>>>>>> LD_LIBRARY_PATH=/home/glebn/openmpi/lib
>>>>>>>
>>>>>>> everything is OK.
>>>>>>>
>>>>>> More info. If I provide hostname to mpirun as returned by command
>>>>>> "hostname" the LD_LIBRARY_PATH is not set:
>>>>>> # /home/glebn/openmpi/bin/mpirun -np 1 -H `hostname` env | grep LD
>>>>>> OLDPWD=/home/glebn/OpenMPI/ompi-tests/intel_tests
>>>>>>
>>>>>> if I provide any other name that resolves to the same IP then
>>>>>> LD_LIBRARY_PATH is set.
>>>>>> # /home/glebn/openmpi/bin/mpirun -np 1 -H localhost env | grep LD
>>>>>> OLDPWD=/home/glebn/OpenMPI/ompi-tests/intel_tests
>>>>>> LD_LIBRARY_PATH=/home/glebn/openmpi/lib
>>>>>>
>>>>>> Here is debug output of "bad" run:
>>>>>> /home/glebn/openmpi/bin/mpirun -np 1 -H `hostname` -mca
>>>>>> pls_rsh_debug 1 echo
>>>>>> [elfit1:14730] pls:rsh: launching job 1
>>>>>> [elfit1:14730] pls:rsh: no new daemons to launch
>>>>>>
>>>>>> Here is good one:
>>>>>> /home/glebn/openmpi/bin/mpirun -np 1 -H localhost -mca
>>>>>> pls_rsh_debug 1 echo
>>>>>> [elfit1:14752] pls:rsh: launching job 1
>>>>>> [elfit1:14752] pls:rsh: local csh: 0, local sh: 1
>>>>>> [elfit1:14752] pls:rsh: assuming same remote shell as local shell
>>>>>> [elfit1:14752] pls:rsh: remote csh: 0, remote sh: 1
>>>>>> [elfit1:14752] pls:rsh: final template argv:
>>>>>> [elfit1:14752] pls:rsh: /usr/bin/ssh <template> orted --name
>>>>>> <template>
>>>>>> --num_procs 1 --vpid_start 0 --nodename <template> --universe
>>>>>> root_at_elfit1:default-universe-14752 --nsreplica
>>>>>> "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" --
>>>>>> gprreplica
>>>>>> "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" -mca
>>>>>> mca_base_param_file_path
>>>>>> /home/glebn/openmpi//share/openmpi/amca-param-sets:/home/USERS/
>>>>>> glebn/openmpi
>>>>>> wd
>>>>>> -mca mca_base_param_file_path_force /home/USERS/glebn/openmpiwd
>>>>>> [elfit1:14752] pls:rsh: launching on node localhost
>>>>>> [elfit1:14752] pls:rsh: localhost is a LOCAL node
>>>>>> [elfit1:14752] pls:rsh: reset PATH:
>>>>>> /home/glebn/openmpi/bin:/home/USERS/lenny/MPI/mpi/bin:/opt/
>>>>>> vltmpi/OPENIB/mpi
>>>>>> /b
>>>>>> in:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/
>>>>>> local/bin:/sbin
>>>>>> :/
>>>>>> bin:/usr/sbin:/usr/bin:/root/bin
>>>>>> [elfit1:14752] pls:rsh: reset LD_LIBRARY_PATH: /home/glebn/
>>>>>> openmpi/lib
>>>>>> [elfit1:14752] pls:rsh: changing to directory /root
>>>>>> [elfit1:14752] pls:rsh: executing: (/home/glebn/openmpi/bin/
>>>>>> orted) [orted
>>>>>> --name 0.0.1 --num_procs 1 --vpid_start 0 --nodename localhost --
>>>>>> universe
>>>>>> root_at_elfit1:default-universe-14752 --nsreplica
>>>>>> "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" --
>>>>>> gprreplica
>>>>>> "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" -mca
>>>>>> mca_base_param_file_path
>>>>>> /home/glebn/openmpi//share/openmpi/amca-param-sets:/home/USERS/
>>>>>> glebn/openmpi
>>>>>> wd
>>>>>> -mca mca_base_param_file_path_force /home/USERS/glebn/openmpiwd
>>>>>> --set-sid]
>>>>>>
>>>>>> --
>>>>>> Gleb.
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>> --
>>>> Gleb.
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel