Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Gleb Natapov (glebn_at_[hidden])
Date: 2007-07-22 04:54:05


On Thu, Jul 19, 2007 at 01:04:27PM -0600, Ralph H Castain wrote:
> I fixed the specific problem of setting the LD_LIBRARY_PATH (and PATH,
> though that wasn't mentioned) for the case of procs spawned locally by
> mpirun - see r15516. Please confirm that the problem is gone and/or let me
> know if it persists for you.
My test cases now work. Thanks.

>
> The issue of name resolution is a more general problem that will take some
> discussion - to occur separately from this chain. So some of the behavior
> you cited continues for the moment.
>
> Thanks
> Ralph
>
>
>
> On 7/19/07 9:39 AM, "Ralph H Castain" <rhc_at_[hidden]> wrote:
>
> > Talked with Brian and we have identified the problem and a fix - will come
> > in later today.
> >
> > Thanks
> > Ralph
> >
> >
> >
> > On 7/19/07 9:24 AM, "Ralph H Castain" <rhc_at_[hidden]> wrote:
> >
> >> You are correct - I misread the note. My bad.
> >>
> >> I'll look at how we might ensure the LD_LIBRARY_PATH shows up correctly -
> >> shouldn't be a big deal.
> >>
> >>
> >> On 7/19/07 9:12 AM, "George Bosilca" <bosilca_at_[hidden]> wrote:
> >>
> >>> The second execution (the one that you make reference to) is the one
> >>> that works fine. The failing one is the first one, where
> >>> LD_LIBRARY_PATH is not provided. As Gleb indicate using localhost
> >>> make the problem vanish.
> >>>
> >>> george.
> >>>
> >>> On Jul 19, 2007, at 10:57 AM, Ralph H Castain wrote:
> >>>
> >>>> But it *does* provide an LD_LIBRARY_PATH that is pointing to your
> >>>> openmpi
> >>>> installation - it says it did it right here in your debug output:
> >>>>
> >>>>>>> [elfit1:14752] pls:rsh: reset LD_LIBRARY_PATH: /home/glebn/
> >>>>>>> openmpi/lib
> >>>>
> >>>> I suspect that the problem isn't in the launcher, but rather in the
> >>>> iof
> >>>> again. Why don't we wait until those fixes come into the trunk before
> >>>> chasing our tails any further?
> >>>>
> >>>>
> >>>> On 7/19/07 8:18 AM, "Gleb Natapov" <glebn_at_[hidden]> wrote:
> >>>>
> >>>>> On Thu, Jul 19, 2007 at 08:07:51AM -0600, Ralph H Castain wrote:
> >>>>>> Interesting. Apparently, it is getting a NULL back when it tries
> >>>>>> to access
> >>>>>> the LD_LIBRARY_PATH in your environment. Here is the code involved:
> >>>>>>
> >>>>>> newenv = opal_os_path( false, prefix_dir, lib_base, NULL );
> >>>>>> oldenv = getenv("LD_LIBRARY_PATH");
> >>>>>> if (NULL != oldenv) {
> >>>>>> char* temp;
> >>>>>> asprintf(&temp, "%s:%s", newenv, oldenv);
> >>>>>> free(newenv);
> >>>>>> newenv = temp;
> >>>>>> }
> >>>>>> opal_setenv("LD_LIBRARY_PATH", newenv, true, &env);
> >>>>>> if (mca_pls_rsh_component.debug) {
> >>>>>> opal_output(0, "pls:rsh: reset LD_LIBRARY_PATH: %s",
> >>>>>> newenv);
> >>>>>> }
> >>>>>> free(newenv);
> >>>>>>
> >>>>>> So you can see that the only way we can get your debugging output
> >>>>>> is for the
> >>>>>> LD_LIBRARY_PATH in your starting environment to be NULL. Note
> >>>>>> that this
> >>>>>> comes after we fork, so we are talking about the child process -
> >>>>>> not sure
> >>>>>> that matters, but may as well point it out.
> >>>>>>
> >>>>>> So the question is: why do you not have LD_LIBRARY_PATH set in your
> >>>>>> environment when you provide a different hostname?
> >>>>> Right I don't have LD_LIBRARY_PATH set in my environment, but I
> >>>>> expect
> >>>>> that mpirun will provide working environment for all ranks not just
> >>>>> remote ones. This is how it worked before. Perhaps that was a bug,
> >>>>> but
> >>>>> this was useful bug :)
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 7/19/07 7:45 AM, "Gleb Natapov" <glebn_at_[hidden]> wrote:
> >>>>>>
> >>>>>>> On Wed, Jul 18, 2007 at 09:08:38PM +0300, Gleb Natapov wrote:
> >>>>>>>> On Wed, Jul 18, 2007 at 09:08:47AM -0600, Ralph H Castain wrote:
> >>>>>>>>> But this will lockup:
> >>>>>>>>>
> >>>>>>>>> pn1180961:~/openmpi/trunk rhc$ mpirun -n 1 -host pn1180961
> >>>>>>>>> printenv | grep
> >>>>>>>>> LD
> >>>>>>>>>
> >>>>>>>>> The reason is that the hostname in this last command doesn't
> >>>>>>>>> match the
> >>>>>>>>> hostname I get when I query my interfaces, so mpirun thinks it
> >>>>>>>>> must be a
> >>>>>>>>> remote host - and so we stick in ssh until that times out.
> >>>>>>>>> Which could be
> >>>>>>>>> quick on your machine, but takes awhile for me.
> >>>>>>>>>
> >>>>>>>> This is not my case. mpirun resolves hostname and runs env but
> >>>>>>>> LD_LIBRARY_PATH is not there. If I use full name like this
> >>>>>>>> # /home/glebn/openmpi/bin/mpirun -np 1 -H elfit1.voltaire.com
> >>>>>>>> env | grep
> >>>>>>>> LD_LIBRARY_PATH
> >>>>>>>> LD_LIBRARY_PATH=/home/glebn/openmpi/lib
> >>>>>>>>
> >>>>>>>> everything is OK.
> >>>>>>>>
> >>>>>>> More info. If I provide hostname to mpirun as returned by command
> >>>>>>> "hostname" the LD_LIBRARY_PATH is not set:
> >>>>>>> # /home/glebn/openmpi/bin/mpirun -np 1 -H `hostname` env | grep LD
> >>>>>>> OLDPWD=/home/glebn/OpenMPI/ompi-tests/intel_tests
> >>>>>>>
> >>>>>>> if I provide any other name that resolves to the same IP then
> >>>>>>> LD_LIBRARY_PATH is set.
> >>>>>>> # /home/glebn/openmpi/bin/mpirun -np 1 -H localhost env | grep LD
> >>>>>>> OLDPWD=/home/glebn/OpenMPI/ompi-tests/intel_tests
> >>>>>>> LD_LIBRARY_PATH=/home/glebn/openmpi/lib
> >>>>>>>
> >>>>>>> Here is debug output of "bad" run:
> >>>>>>> /home/glebn/openmpi/bin/mpirun -np 1 -H `hostname` -mca
> >>>>>>> pls_rsh_debug 1 echo
> >>>>>>> [elfit1:14730] pls:rsh: launching job 1
> >>>>>>> [elfit1:14730] pls:rsh: no new daemons to launch
> >>>>>>>
> >>>>>>> Here is good one:
> >>>>>>> /home/glebn/openmpi/bin/mpirun -np 1 -H localhost -mca
> >>>>>>> pls_rsh_debug 1 echo
> >>>>>>> [elfit1:14752] pls:rsh: launching job 1
> >>>>>>> [elfit1:14752] pls:rsh: local csh: 0, local sh: 1
> >>>>>>> [elfit1:14752] pls:rsh: assuming same remote shell as local shell
> >>>>>>> [elfit1:14752] pls:rsh: remote csh: 0, remote sh: 1
> >>>>>>> [elfit1:14752] pls:rsh: final template argv:
> >>>>>>> [elfit1:14752] pls:rsh: /usr/bin/ssh <template> orted --name
> >>>>>>> <template>
> >>>>>>> --num_procs 1 --vpid_start 0 --nodename <template> --universe
> >>>>>>> root_at_elfit1:default-universe-14752 --nsreplica
> >>>>>>> "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" --
> >>>>>>> gprreplica
> >>>>>>> "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" -mca
> >>>>>>> mca_base_param_file_path
> >>>>>>> /home/glebn/openmpi//share/openmpi/amca-param-sets:/home/USERS/
> >>>>>>> glebn/openmpi
> >>>>>>> wd
> >>>>>>> -mca mca_base_param_file_path_force /home/USERS/glebn/openmpiwd
> >>>>>>> [elfit1:14752] pls:rsh: launching on node localhost
> >>>>>>> [elfit1:14752] pls:rsh: localhost is a LOCAL node
> >>>>>>> [elfit1:14752] pls:rsh: reset PATH:
> >>>>>>> /home/glebn/openmpi/bin:/home/USERS/lenny/MPI/mpi/bin:/opt/
> >>>>>>> vltmpi/OPENIB/mpi
> >>>>>>> /b
> >>>>>>> in:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/
> >>>>>>> local/bin:/sbin
> >>>>>>> :/
> >>>>>>> bin:/usr/sbin:/usr/bin:/root/bin
> >>>>>>> [elfit1:14752] pls:rsh: reset LD_LIBRARY_PATH: /home/glebn/
> >>>>>>> openmpi/lib
> >>>>>>> [elfit1:14752] pls:rsh: changing to directory /root
> >>>>>>> [elfit1:14752] pls:rsh: executing: (/home/glebn/openmpi/bin/
> >>>>>>> orted) [orted
> >>>>>>> --name 0.0.1 --num_procs 1 --vpid_start 0 --nodename localhost --
> >>>>>>> universe
> >>>>>>> root_at_elfit1:default-universe-14752 --nsreplica
> >>>>>>> "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" --
> >>>>>>> gprreplica
> >>>>>>> "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" -mca
> >>>>>>> mca_base_param_file_path
> >>>>>>> /home/glebn/openmpi//share/openmpi/amca-param-sets:/home/USERS/
> >>>>>>> glebn/openmpi
> >>>>>>> wd
> >>>>>>> -mca mca_base_param_file_path_force /home/USERS/glebn/openmpiwd
> >>>>>>> --set-sid]
> >>>>>>>
> >>>>>>> --
> >>>>>>> Gleb.
> >>>>>>> _______________________________________________
> >>>>>>> devel mailing list
> >>>>>>> devel_at_[hidden]
> >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> devel mailing list
> >>>>>> devel_at_[hidden]
> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>>
> >>>>> --
> >>>>> Gleb.
> >>>>> _______________________________________________
> >>>>> devel mailing list
> >>>>> devel_at_[hidden]
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> devel mailing list
> >>>> devel_at_[hidden]
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>
> >>> _______________________________________________
> >>> devel mailing list
> >>> devel_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>
> >>
> >> _______________________________________________
> >> devel mailing list
> >> devel_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> >
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
			Gleb.