Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] problem calling mpirun from script invoked with mpirun
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-10-28 21:06:05


I see. No, we don't copy your envars and ship them to remote nodes. Simple
reason is that we don't know which ones we can safely move, and which would
cause problems.

However, we do provide a mechanism for you to tell us which envars to move.
Just add:

-x LD_LIBRARY_PATH

to your mpirun cmd line and we will pickup that value and move it. You can
use that option any number of times.

I know it's a tad tedious if you have to move many of them, but it's the
only safe mechanism we could devise.

HTH
Ralph

On Wed, Oct 28, 2009 at 2:36 PM, Luke Shulenburger
<lshulenburger_at_[hidden]>wrote:

> My apologies for not being clear. These variables are set in my
> environment, they just are not published to the other nodes in the
> cluster when the jobs are run through the scheduler. At the moment,
> even though I can use mpirun to run jobs locally on the head node
> without touching my environment, if I use the scheduler I am forced to
> do something like source my bashrc in the jub submission script to get
> them set. I had always assumed that mpirun just copied my current
> environment variables to the nodes, but this does not seem to be
> happening in this case.
>
> Luke
>
> On Wed, Oct 28, 2009 at 4:30 PM, Ralph Castain <rhc_at_[hidden]> wrote:
> > Normally, one does simply set the ld_library_path in your environment to
> > point to the right thing. Alternatively, you could configure OMPI with
> >
> > --enable-mpirun-prefix-by-default
> >
> > This tells OMPI to automatically add the prefix you configured the system
> > with to your ld_library_path and path envars. It should solve your
> problem,
> > if you don't want to simply set those values in your environment anyway.
> >
> > Ralph
> >
> >
> > On Wed, Oct 28, 2009 at 2:10 PM, Luke Shulenburger <
> lshulenburger_at_[hidden]>
> > wrote:
> >>
> >> Thanks for the quick reply. This leads me to another issue I have
> >> been having with openmpi as it relates to sge. The "tight
> >> integration" works where I do not have to give mpirun a hostfile when
> >> I use the scheduler, but it does not seem to be passing on my
> >> environment variables. Specifically because I used intel compilers to
> >> compile openmpi, I have to be sure to set the LD_LIBRARY_PATH
> >> correctly in my job submission script or openmpi will not run (giving
> >> the error discussed in the FAQ). Where I am a little lost is whether
> >> this is a problem with the way I built openmpi or whether it is a
> >> configuration problem with sge.
> >>
> >> This may be unrelated to my previous problem, but the similarities
> >> with the environment variables made me think of it.
> >>
> >> Thanks for your consideration,
> >> Luke Shulenburger
> >> Geophysical Laboratory
> >> Carnegie Institution of Washington
> >>
> >> On Wed, Oct 28, 2009 at 3:48 PM, Ralph Castain <rhc_at_[hidden]>
> wrote:
> >> > I'm afraid we have never really supported this kind of nested
> >> > invocations of
> >> > mpirun. If it works with any version of OMPI, it is totally a fluke -
> it
> >> > might work one time, and then fail the next.
> >> >
> >> > The problem is that we pass envars to the launched processes to
> control
> >> > their behavior, and these conflict with what mpirun needs. We have
> tried
> >> > various scrubbing mechanisms (i.e., having mpirun start out by
> scrubbing
> >> > the
> >> > environment of envars that would have come from the initial mpirun,
> but
> >> > they
> >> > all have the unfortunate possibility of removing parameters provided
> by
> >> > the
> >> > user - and that can cause its own problems.
> >> >
> >> > I don't know if we will ever support nested operations - occasionally,
> I
> >> > do
> >> > give it some thought, but have yet to find a foolproof solution.
> >> >
> >> > Ralph
> >> >
> >> >
> >> > On Wed, Oct 28, 2009 at 1:11 PM, Luke Shulenburger
> >> > <lshulenburger_at_[hidden]>
> >> > wrote:
> >> >>
> >> >> Hello,
> >> >> I am having trouble with a script that calls mpi. Basically my
> >> >> problem distills to wanting to call a script with:
> >> >>
> >> >> mpirun -np # ./script.sh
> >> >>
> >> >> where script.sh looks like:
> >> >> #!/bin/bash
> >> >> mpirun -np 2 ./mpiprogram
> >> >>
> >> >> Whenever I invoke script.sh normally (as ./script.sh for instance) it
> >> >> works fine, but if I do mpirun -np 2 ./script.sh I get the following
> >> >> error:
> >> >>
> >> >> [ppv.stanford.edu:08814] [[27860,1],0] ORTE_ERROR_LOG: A message is
> >> >> attempting to be sent to a process whose contact information is
> >> >> unknown in file rml_oob_send.c at line 105
> >> >> [ppv.stanford.edu:08814] [[27860,1],0] could not get route to
> >> >> [[INVALID],INVALID]
> >> >> [ppv.stanford.edu:08814] [[27860,1],0] ORTE_ERROR_LOG: A message is
> >> >> attempting to be sent to a process whose contact information is
> >> >> unknown in file base/plm_base_proxy.c at line 86
> >> >>
> >> >> I have also tried running with mpirun -d to get some debugging info
> >> >> and it appears that the proctable is not being created for the second
> >> >> mpirun. The command hangs like so:
> >> >>
> >> >> [ppv.stanford.edu:08823] procdir:
> >> >> /tmp/openmpi-sessions-sluke_at_[hidden]_0/27855/0/0
> >> >> [ppv.stanford.edu:08823] jobdir:
> >> >> /tmp/openmpi-sessions-sluke_at_[hidden]_0/27855/0
> >> >> [ppv.stanford.edu:08823] top:
> openmpi-sessions-sluke_at_[hidden]_0
> >> >> [ppv.stanford.edu:08823] tmp: /tmp
> >> >> [ppv.stanford.edu:08823] [[27855,0],0] node[0].name ppv daemon 0
> arch
> >> >> ffc91200
> >> >> [ppv.stanford.edu:08823] Info: Setting up debugger process table for
> >> >> applications
> >> >> MPIR_being_debugged = 0
> >> >> MPIR_debug_state = 1
> >> >> MPIR_partial_attach_ok = 1
> >> >> MPIR_i_am_starter = 0
> >> >> MPIR_proctable_size = 1
> >> >> MPIR_proctable:
> >> >> (i, host, exe, pid) = (0, ppv.stanford.edu,
> >> >> /home/sluke/maintenance/openmpi-1.3.3/examples/./shell.sh, 8824)
> >> >> [ppv.stanford.edu:08825] procdir:
> >> >> /tmp/openmpi-sessions-sluke_at_[hidden]_0/27855/1/0
> >> >> [ppv.stanford.edu:08825] jobdir:
> >> >> /tmp/openmpi-sessions-sluke_at_[hidden]_0/27855/1
> >> >> [ppv.stanford.edu:08825] top:
> openmpi-sessions-sluke_at_[hidden]_0
> >> >> [ppv.stanford.edu:08825] tmp: /tmp
> >> >> [ppv.stanford.edu:08825] [[27855,1],0] ORTE_ERROR_LOG: A message is
> >> >> attempting to be sent to a process whose contact information is
> >> >> unknown in file rml_oob_send.c at line 105
> >> >> [ppv.stanford.edu:08825] [[27855,1],0] could not get route to
> >> >> [[INVALID],INVALID]
> >> >> [ppv.stanford.edu:08825] [[27855,1],0] ORTE_ERROR_LOG: A message is
> >> >> attempting to be sent to a process whose contact information is
> >> >> unknown in file base/plm_base_proxy.c at line 86
> >> >> [ppv.stanford.edu:08825] Info: Setting up debugger process table for
> >> >> applications
> >> >> MPIR_being_debugged = 0
> >> >> MPIR_debug_state = 1
> >> >> MPIR_partial_attach_ok = 1
> >> >> MPIR_i_am_starter = 0
> >> >> MPIR_proctable_size = 0
> >> >> MPIR_proctable:
> >> >>
> >> >>
> >> >> In this case, it does not matter what the ultimate mpiprogram I try
> to
> >> >> run is, the shell script fails in the same way regardless (I've tried
> >> >> the hello_f90 executable from the openmpi examples directory). Here
> >> >> are some details of my setup:
> >> >>
> >> >> I have built openmpi 1.3.3 with the intel fortran in c compilers
> >> >> (version 11.1). The machine uses rocks with the SGE scheduler, so I
> >> >> have run autoconf with ./configure --prefix=/home/sluke --with-sge,
> >> >> however this problem persists even if I am running on the head node
> >> >> outside of the scheduler. I am attaching the resulting config.log to
> >> >> this email as well as output to ompi_info --all and ifconfig. I hope
> >> >> this gives the experts on the list enough to go from, but I will be
> >> >> happy to provide any more information that might be helpful.
> >> >>
> >> >> Luke Shulenburger
> >> >> Geophysical Laboratory
> >> >> Carnegie Institution of Washington
> >> >>
> >> >>
> >> >> PS I have tried this on a machine with openmpi-1.2.6 and cannot
> >> >> reproduce the error, however on a second machine with openmpi-1.3.2 I
> >> >> have the same problem.
> >> >>
> >> >> _______________________________________________
> >> >> users mailing list
> >> >> users_at_[hidden]
> >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> >
> >> >
> >> > _______________________________________________
> >> > users mailing list
> >> > users_at_[hidden]
> >> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> >
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>