Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Lev Gelb (gelb_at_[hidden])
Date: 2007-07-11 20:32:13


Well done, that was exactly the problem -

Python's os.environ passes the complete collection of shell variables.

I tried a different os method (os.execve) , where I could specify the
environment (I took out all the OMPI_* variables) and the second orterun
call worked!

Now I just need a cleaner way to reset the environment within the
spawned process. (Or, a way to tell orterun to ignore/overwrite the
existing OMPI_* variables...?)

Thanks for your help,

Lev

On Wed, 11 Jul 2007, Ralph Castain wrote:

> Hmmm...interesting. As a cross-check on something - if you os.system, does
> your environment by any chance get copied across? Reason I ask: we set a
> number of environmental variables when orterun spawns a process. If you call
> orterun from within that process - and the new orterun sees the enviro
> variables from the parent process - then I can guarantee it won't work.
>
> What you need is for os.system to start its child with a clean environment.
> I would imagine if you just os.system'd something that output'd the
> environment, that would suffice to identify the problem. If you see anything
> that starts with OMPI_MCA_..., then we are indeed doomed.
>
> Which would definitely explain why the persistent orted wouldn't help solve
> the problem.
>
> Ralph
>
>
>
> On 7/11/07 3:05 PM, "Lev Gelb" <gelb_at_[hidden]> wrote:
>
>>
>> Thanks for the suggestions. The separate 'orted' scheme (below) did
>> not work, unfortunately; same behavior as before. I have conducted
>> a few other simple tests, and found:
>>
>> 1. The problem only occurs if the first process is "in" MPI;
>> if it doesn't call MPI_Init or calls MPI_Finalize before it executes
>> the second orterun, everything works.
>>
>> 2. Whether or not the second process actually uses MPI doesn't matter.
>>
>> 3. Using the standalone orted in "debug" mode with "universe"
>> specified throughout, there does not appear to be any communication to
>> orted upon the second invocation of orterun
>>
>> (Also, I've been able to get working nested orteruns using simple shell
>> scripts, but these don't call MPI_Init.)
>>
>> Cheers,
>>
>> Lev
>>
>>
>>
>> On Wed, 11 Jul 2007, Ralph H Castain wrote:
>>
>>> Hmmm...well, what that indicates is that your application program is losing
>>> the connection to orterun, but that orterun is still alive and kicking (it
>>> is alive enough to send the [0,0,1] daemon a message ordering it to exit).
>>> So the question is: why is your application program dropping the connection?
>>>
>>> I haven't tried doing embedded orterun commands, so there could be a
>>> conflict there that causes the OOB connection to drop. Best guess is that
>>> there is confusion over which orterun it is supposed to connect to. I can
>>> give it a try and see - this may not be a mode we can support.
>>>
>>> Alternatively, you could start a persistent daemon and then just allow both
>>> orterun instances to report to it. Our method for doing that isn't as
>>> convenient as we want it to be, and hope to soon have it, but it does work.
>>> What you have to do is:
>>>
>>> 1. to start the persistent daemon, type:
>>>
>>> "orted --seed --persistent --scope public --universe foo"
>>>
>>> where foo can be whatever name you like.
>>>
>>> 2. when you execute your application, use:
>>>
>>> orterun -np 1 --universe foo python ./test.py
>>>
>>> where the "foo" matches the name given above.
>>>
>>> 3. in your os.system command, you'll need that same "--universe foo" option
>>>
>>> That may solve the problem (let me know if it does). Meantime, I'll take a
>>> look at the embedded approach without the persistent daemon...may take me
>>> awhile as I'm in the middle of something, but I will try to get to it
>>> shortly.
>>>
>>> Ralph
>>>
>>>
>>> On 7/11/07 1:40 PM, "Lev Gelb" <gelb_at_[hidden]> wrote:
>>>
>>>>
>>>> OK, I've added the debug flags - when I add them to the
>>>> os.system instance of orterun, there is no additional input,
>>>> but when I add them to the orterun instance controlling the
>>>> python program, I get the following:
>>>>
>>>>> orterun -np 1 --debug-daemons -mca odls_base_verbose 1 python ./test.py
>>>> Daemon [0,0,1] checking in as pid 18054 on host druid.wustl.edu
>>>> [druid.wustl.edu:18054] [0,0,1] orted: received launch callback
>>>> [druid.wustl.edu:18054] odls: setting up launch for job 1
>>>> [druid.wustl.edu:18054] odls: overriding oversubscription
>>>> [druid.wustl.edu:18054] odls: oversubscribed set to false want_processor
>>>> set to true
>>>> [druid.wustl.edu:18054] odls: preparing to launch child [0, 1, 0]
>>>> Pypar (version 1.9.3) initialised MPI OK with 1 processors
>>>> [druid.wustl.edu:18057] OOB: Connection to HNP lost
>>>> [druid.wustl.edu:18054] odls: child process terminated
>>>> [druid.wustl.edu:18054] odls: child process [0,1,0] terminated normally
>>>> [druid.wustl.edu:18054] [0,0,1] orted_recv_pls: received message from
>>>> [0,0,0]
>>>> [druid.wustl.edu:18054] [0,0,1] orted_recv_pls: received exit
>>>> [druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: working on job -1
>>>> [druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: checking child
>>>> process [0,1,0]
>>>> [druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: child is not alive
>>>>
>>>> (the Pypar output is from loading that module; the next thing in
>>>> the code is the os.system call to start orterun with 2 processors.)
>>>>
>>>> Also, there is absolutely no output from the second orterun-launched
>>>> program (even the first line does not execute.)
>>>>
>>>> Cheers,
>>>>
>>>> Lev
>>>>
>>>>
>>>>
>>>>> Message: 5
>>>>> Date: Wed, 11 Jul 2007 13:26:22 -0600
>>>>> From: Ralph H Castain <rhc_at_[hidden]>
>>>>> Subject: Re: [OMPI users] Recursive use of "orterun"
>>>>> To: "Open MPI Users <users_at_[hidden]>" <users_at_[hidden]>
>>>>> Message-ID: <C2BA8AFE.9E64%rhc_at_[hidden]>
>>>>> Content-Type: text/plain; charset="US-ASCII"
>>>>>
>>>>> I'm unaware of any issues that would cause it to fail just because it is
>>>>> being run via that interface.
>>>>>
>>>>> The error message is telling us that the procs got launched, but then
>>>>> orterun went away unexpectedly. Are you seeing your procs complete? We do
>>>>> sometimes see that message due to a race condition between the daemons
>>>>> spawned to support the application procs and orterun itself (see other
>>>>> recent notes in this forum).
>>>>>
>>>>> If your procs are not completing, then it would mean that either the
>>>>> connecting fabric is failing for some reason, or orterun is terminating
>>>>> early. If you could add --debug-daemons -mca odls_base_verbose 1 to the
>>>>> os.system command, the output from that might help us understand why it is
>>>>> failing.
>>>>>
>>>>> Ralph
>>>>>
>>>>>
>>>>>
>>>>> On 7/11/07 10:49 AM, "Lev Gelb" <gelb_at_[hidden]> wrote:
>>>>>
>>>>>>
>>>>>> Hi -
>>>>>>
>>>>>> I'm trying to port an application to use OpenMPI, and running
>>>>>> into a problem. The program (written in Python, parallelized
>>>>>> using either of "pypar" or "pyMPI") itself invokes "mpirun"
>>>>>> in order to manage external, parallel processes, via something like:
>>>>>>
>>>>>> orterun -np 2 python myapp.py
>>>>>>
>>>>>> where myapp.py contains:
>>>>>>
>>>>>> os.system('orterun -np 2 nwchem.x nwchem.inp > nwchem.out')
>>>>>>
>>>>>> I have this working under both LAM-MPI and MPICH on a variety
>>>>>> of different machines. However, with OpenMPI, all I get is an
>>>>>> immediate return from the system call and the error:
>>>>>>
>>>>>> "OOB: Connection to HNP lost"
>>>>>>
>>>>>> I have verified that the command passed to os.system is correct,
>>>>>> and even that it runs correctly if "myapp.py" doesn't invoke any
>>>>>> MPI calls of its own.
>>>>>>
>>>>>> I'm testing openMPI on a single box, so there's no machinefile-stuff
>>>>>> currently
>>>>>> active. The system is running Fedora Core 6 x86-64, I'm using the latest
>>>>>> openmpi-1.2.3-1.src.rpm rebuilt on the machine in question,
>>>>>> I can provide additional configuration details if necessary.
>>>>>>
>>>>>> Thanks, in advance, for any help or advice,
>>>>>>
>>>>>> Lev
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------
>>>>>> Lev Gelb Associate Professor Department of Chemistry, Washington
>>>>>> University
>>>>>> in
>>>>>> St. Louis, St. Louis, MO 63130 USA
>>>>>>
>>>>>> email: gelb_at_[hidden]
>>>>>> phone: (314)935-5026 fax: (314)935-4481
>>>>>>
>>>>>> http://www.chemistry.wustl.edu/~gelb
>>>>>> ------------------------------------------------------------------
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> ------------------------------------------------------------------
>> Lev Gelb
>> Associate Professor
>> Department of Chemistry,
>> Washington University in St. Louis,
>> St. Louis, MO 63130 USA
>>
>> email: gelb_at_[hidden]
>> phone: (314)935-5026
>> fax: (314)935-4481
>>
>> http://www.chemistry.wustl.edu/~gelb
>> ------------------------------------------------------------------
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

------------------------------------------------------------------
Lev Gelb
Associate Professor
Department of Chemistry,
Washington University in St. Louis,
St. Louis, MO 63130 USA

email: gelb_at_[hidden]
phone: (314)935-5026
fax: (314)935-4481

http://www.chemistry.wustl.edu/~gelb
------------------------------------------------------------------