Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] problem calling mpirun from script invoked with mpirun
From: Luke Shulenburger (lshulenburger_at_[hidden])
Date: 2009-10-28 16:36:46


My apologies for not being clear. These variables are set in my
environment, they just are not published to the other nodes in the
cluster when the jobs are run through the scheduler. At the moment,
even though I can use mpirun to run jobs locally on the head node
without touching my environment, if I use the scheduler I am forced to
do something like source my bashrc in the jub submission script to get
them set. I had always assumed that mpirun just copied my current
environment variables to the nodes, but this does not seem to be
happening in this case.

Luke

On Wed, Oct 28, 2009 at 4:30 PM, Ralph Castain <rhc_at_[hidden]> wrote:
> Normally, one does simply set the ld_library_path in your environment to
> point to the right thing. Alternatively, you could configure OMPI with
>
> --enable-mpirun-prefix-by-default
>
> This tells OMPI to automatically add the prefix you configured the system
> with to your ld_library_path and path envars. It should solve your problem,
> if you don't want to simply set those values in your environment anyway.
>
> Ralph
>
>
> On Wed, Oct 28, 2009 at 2:10 PM, Luke Shulenburger <lshulenburger_at_[hidden]>
> wrote:
>>
>> Thanks for the quick reply.  This leads me to another issue I have
>> been having with openmpi as it relates to sge.  The "tight
>> integration" works where I do not have to give mpirun a hostfile when
>> I use the scheduler, but it does not seem to be passing on my
>> environment variables.  Specifically because I used intel compilers to
>> compile openmpi, I have to be sure to set the LD_LIBRARY_PATH
>> correctly in my job submission script or openmpi will not run (giving
>> the error discussed in the FAQ).  Where I am a little lost is whether
>> this is a problem with the way I built openmpi or whether it is a
>> configuration problem with sge.
>>
>> This may be unrelated to my previous problem, but the similarities
>> with the environment variables made me think of it.
>>
>> Thanks for your consideration,
>> Luke Shulenburger
>> Geophysical Laboratory
>> Carnegie Institution of Washington
>>
>> On Wed, Oct 28, 2009 at 3:48 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>> > I'm afraid we have never really supported this kind of nested
>> > invocations of
>> > mpirun. If it works with any version of OMPI, it is totally a fluke - it
>> > might work one time, and then fail the next.
>> >
>> > The problem is that we pass envars to the launched processes to control
>> > their behavior, and these conflict with what mpirun needs. We have tried
>> > various scrubbing mechanisms (i.e., having mpirun start out by scrubbing
>> > the
>> > environment of envars that would have come from the initial mpirun, but
>> > they
>> > all have the unfortunate possibility of removing parameters provided by
>> > the
>> > user - and that can cause its own problems.
>> >
>> > I don't know if we will ever support nested operations - occasionally, I
>> > do
>> > give it some thought, but have yet to find a foolproof solution.
>> >
>> > Ralph
>> >
>> >
>> > On Wed, Oct 28, 2009 at 1:11 PM, Luke Shulenburger
>> > <lshulenburger_at_[hidden]>
>> > wrote:
>> >>
>> >> Hello,
>> >> I am having trouble with a script that calls mpi.  Basically my
>> >> problem distills to wanting to call a script with:
>> >>
>> >> mpirun -np # ./script.sh
>> >>
>> >> where script.sh looks like:
>> >> #!/bin/bash
>> >> mpirun -np 2 ./mpiprogram
>> >>
>> >> Whenever I invoke script.sh normally (as ./script.sh for instance) it
>> >> works fine, but if I do mpirun -np 2 ./script.sh I get the following
>> >> error:
>> >>
>> >> [ppv.stanford.edu:08814] [[27860,1],0] ORTE_ERROR_LOG: A message is
>> >> attempting to be sent to a process whose contact information is
>> >> unknown in file rml_oob_send.c at line 105
>> >> [ppv.stanford.edu:08814] [[27860,1],0] could not get route to
>> >> [[INVALID],INVALID]
>> >> [ppv.stanford.edu:08814] [[27860,1],0] ORTE_ERROR_LOG: A message is
>> >> attempting to be sent to a process whose contact information is
>> >> unknown in file base/plm_base_proxy.c at line 86
>> >>
>> >> I have also tried running with mpirun -d to get some debugging info
>> >> and it appears that the proctable is not being created for the second
>> >> mpirun.  The command hangs like so:
>> >>
>> >> [ppv.stanford.edu:08823] procdir:
>> >> /tmp/openmpi-sessions-sluke_at_[hidden]_0/27855/0/0
>> >> [ppv.stanford.edu:08823] jobdir:
>> >> /tmp/openmpi-sessions-sluke_at_[hidden]_0/27855/0
>> >> [ppv.stanford.edu:08823] top: openmpi-sessions-sluke_at_[hidden]_0
>> >> [ppv.stanford.edu:08823] tmp: /tmp
>> >> [ppv.stanford.edu:08823] [[27855,0],0] node[0].name ppv daemon 0 arch
>> >> ffc91200
>> >> [ppv.stanford.edu:08823] Info: Setting up debugger process table for
>> >> applications
>> >>  MPIR_being_debugged = 0
>> >>  MPIR_debug_state = 1
>> >>  MPIR_partial_attach_ok = 1
>> >>  MPIR_i_am_starter = 0
>> >>  MPIR_proctable_size = 1
>> >>  MPIR_proctable:
>> >>    (i, host, exe, pid) = (0, ppv.stanford.edu,
>> >> /home/sluke/maintenance/openmpi-1.3.3/examples/./shell.sh, 8824)
>> >> [ppv.stanford.edu:08825] procdir:
>> >> /tmp/openmpi-sessions-sluke_at_[hidden]_0/27855/1/0
>> >> [ppv.stanford.edu:08825] jobdir:
>> >> /tmp/openmpi-sessions-sluke_at_[hidden]_0/27855/1
>> >> [ppv.stanford.edu:08825] top: openmpi-sessions-sluke_at_[hidden]_0
>> >> [ppv.stanford.edu:08825] tmp: /tmp
>> >> [ppv.stanford.edu:08825] [[27855,1],0] ORTE_ERROR_LOG: A message is
>> >> attempting to be sent to a process whose contact information is
>> >> unknown in file rml_oob_send.c at line 105
>> >> [ppv.stanford.edu:08825] [[27855,1],0] could not get route to
>> >> [[INVALID],INVALID]
>> >> [ppv.stanford.edu:08825] [[27855,1],0] ORTE_ERROR_LOG: A message is
>> >> attempting to be sent to a process whose contact information is
>> >> unknown in file base/plm_base_proxy.c at line 86
>> >> [ppv.stanford.edu:08825] Info: Setting up debugger process table for
>> >> applications
>> >>  MPIR_being_debugged = 0
>> >>  MPIR_debug_state = 1
>> >>  MPIR_partial_attach_ok = 1
>> >>  MPIR_i_am_starter = 0
>> >>  MPIR_proctable_size = 0
>> >>  MPIR_proctable:
>> >>
>> >>
>> >> In this case, it does not matter what the ultimate mpiprogram I try to
>> >> run is, the shell script fails in the same way regardless (I've tried
>> >> the hello_f90 executable from the openmpi examples directory).  Here
>> >> are some details of my setup:
>> >>
>> >> I have built openmpi 1.3.3 with the intel fortran in c compilers
>> >> (version 11.1).  The machine uses rocks with the SGE scheduler, so I
>> >> have run autoconf with ./configure --prefix=/home/sluke --with-sge,
>> >> however this problem persists even if I am running on the head node
>> >> outside of the scheduler.  I am attaching the resulting config.log to
>> >> this email as well as output to ompi_info --all and ifconfig.  I hope
>> >> this gives the experts on the list enough to go from, but I will be
>> >> happy to provide any more information that might be helpful.
>> >>
>> >> Luke Shulenburger
>> >> Geophysical Laboratory
>> >> Carnegie Institution of Washington
>> >>
>> >>
>> >> PS I have tried this on a machine with openmpi-1.2.6 and cannot
>> >> reproduce the error, however on a second machine with openmpi-1.3.2 I
>> >> have the same problem.
>> >>
>> >> _______________________________________________
>> >> users mailing list
>> >> users_at_[hidden]
>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> >
>> > _______________________________________________
>> > users mailing list
>> > users_at_[hidden]
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>