Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Ralph Castain (rhc_at_[hidden])
Date: 2007-07-11 18:10:40


Hmmm...interesting. As a cross-check on something - if you os.system, does
your environment by any chance get copied across? Reason I ask: we set a
number of environmental variables when orterun spawns a process. If you call
orterun from within that process - and the new orterun sees the enviro
variables from the parent process - then I can guarantee it won't work.

What you need is for os.system to start its child with a clean environment.
I would imagine if you just os.system'd something that output'd the
environment, that would suffice to identify the problem. If you see anything
that starts with OMPI_MCA_..., then we are indeed doomed.

Which would definitely explain why the persistent orted wouldn't help solve
the problem.

Ralph

On 7/11/07 3:05 PM, "Lev Gelb" <gelb_at_[hidden]> wrote:

>
> Thanks for the suggestions. The separate 'orted' scheme (below) did
> not work, unfortunately; same behavior as before. I have conducted
> a few other simple tests, and found:
>
> 1. The problem only occurs if the first process is "in" MPI;
> if it doesn't call MPI_Init or calls MPI_Finalize before it executes
> the second orterun, everything works.
>
> 2. Whether or not the second process actually uses MPI doesn't matter.
>
> 3. Using the standalone orted in "debug" mode with "universe"
> specified throughout, there does not appear to be any communication to
> orted upon the second invocation of orterun
>
> (Also, I've been able to get working nested orteruns using simple shell
> scripts, but these don't call MPI_Init.)
>
> Cheers,
>
> Lev
>
>
>
> On Wed, 11 Jul 2007, Ralph H Castain wrote:
>
>> Hmmm...well, what that indicates is that your application program is losing
>> the connection to orterun, but that orterun is still alive and kicking (it
>> is alive enough to send the [0,0,1] daemon a message ordering it to exit).
>> So the question is: why is your application program dropping the connection?
>>
>> I haven't tried doing embedded orterun commands, so there could be a
>> conflict there that causes the OOB connection to drop. Best guess is that
>> there is confusion over which orterun it is supposed to connect to. I can
>> give it a try and see - this may not be a mode we can support.
>>
>> Alternatively, you could start a persistent daemon and then just allow both
>> orterun instances to report to it. Our method for doing that isn't as
>> convenient as we want it to be, and hope to soon have it, but it does work.
>> What you have to do is:
>>
>> 1. to start the persistent daemon, type:
>>
>> "orted --seed --persistent --scope public --universe foo"
>>
>> where foo can be whatever name you like.
>>
>> 2. when you execute your application, use:
>>
>> orterun -np 1 --universe foo python ./test.py
>>
>> where the "foo" matches the name given above.
>>
>> 3. in your os.system command, you'll need that same "--universe foo" option
>>
>> That may solve the problem (let me know if it does). Meantime, I'll take a
>> look at the embedded approach without the persistent daemon...may take me
>> awhile as I'm in the middle of something, but I will try to get to it
>> shortly.
>>
>> Ralph
>>
>>
>> On 7/11/07 1:40 PM, "Lev Gelb" <gelb_at_[hidden]> wrote:
>>
>>>
>>> OK, I've added the debug flags - when I add them to the
>>> os.system instance of orterun, there is no additional input,
>>> but when I add them to the orterun instance controlling the
>>> python program, I get the following:
>>>
>>>> orterun -np 1 --debug-daemons -mca odls_base_verbose 1 python ./test.py
>>> Daemon [0,0,1] checking in as pid 18054 on host druid.wustl.edu
>>> [druid.wustl.edu:18054] [0,0,1] orted: received launch callback
>>> [druid.wustl.edu:18054] odls: setting up launch for job 1
>>> [druid.wustl.edu:18054] odls: overriding oversubscription
>>> [druid.wustl.edu:18054] odls: oversubscribed set to false want_processor
>>> set to true
>>> [druid.wustl.edu:18054] odls: preparing to launch child [0, 1, 0]
>>> Pypar (version 1.9.3) initialised MPI OK with 1 processors
>>> [druid.wustl.edu:18057] OOB: Connection to HNP lost
>>> [druid.wustl.edu:18054] odls: child process terminated
>>> [druid.wustl.edu:18054] odls: child process [0,1,0] terminated normally
>>> [druid.wustl.edu:18054] [0,0,1] orted_recv_pls: received message from
>>> [0,0,0]
>>> [druid.wustl.edu:18054] [0,0,1] orted_recv_pls: received exit
>>> [druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: working on job -1
>>> [druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: checking child
>>> process [0,1,0]
>>> [druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: child is not alive
>>>
>>> (the Pypar output is from loading that module; the next thing in
>>> the code is the os.system call to start orterun with 2 processors.)
>>>
>>> Also, there is absolutely no output from the second orterun-launched
>>> program (even the first line does not execute.)
>>>
>>> Cheers,
>>>
>>> Lev
>>>
>>>
>>>
>>>> Message: 5
>>>> Date: Wed, 11 Jul 2007 13:26:22 -0600
>>>> From: Ralph H Castain <rhc_at_[hidden]>
>>>> Subject: Re: [OMPI users] Recursive use of "orterun"
>>>> To: "Open MPI Users <users_at_[hidden]>" <users_at_[hidden]>
>>>> Message-ID: <C2BA8AFE.9E64%rhc_at_[hidden]>
>>>> Content-Type: text/plain; charset="US-ASCII"
>>>>
>>>> I'm unaware of any issues that would cause it to fail just because it is
>>>> being run via that interface.
>>>>
>>>> The error message is telling us that the procs got launched, but then
>>>> orterun went away unexpectedly. Are you seeing your procs complete? We do
>>>> sometimes see that message due to a race condition between the daemons
>>>> spawned to support the application procs and orterun itself (see other
>>>> recent notes in this forum).
>>>>
>>>> If your procs are not completing, then it would mean that either the
>>>> connecting fabric is failing for some reason, or orterun is terminating
>>>> early. If you could add --debug-daemons -mca odls_base_verbose 1 to the
>>>> os.system command, the output from that might help us understand why it is
>>>> failing.
>>>>
>>>> Ralph
>>>>
>>>>
>>>>
>>>> On 7/11/07 10:49 AM, "Lev Gelb" <gelb_at_[hidden]> wrote:
>>>>
>>>>>
>>>>> Hi -
>>>>>
>>>>> I'm trying to port an application to use OpenMPI, and running
>>>>> into a problem. The program (written in Python, parallelized
>>>>> using either of "pypar" or "pyMPI") itself invokes "mpirun"
>>>>> in order to manage external, parallel processes, via something like:
>>>>>
>>>>> orterun -np 2 python myapp.py
>>>>>
>>>>> where myapp.py contains:
>>>>>
>>>>> os.system('orterun -np 2 nwchem.x nwchem.inp > nwchem.out')
>>>>>
>>>>> I have this working under both LAM-MPI and MPICH on a variety
>>>>> of different machines. However, with OpenMPI, all I get is an
>>>>> immediate return from the system call and the error:
>>>>>
>>>>> "OOB: Connection to HNP lost"
>>>>>
>>>>> I have verified that the command passed to os.system is correct,
>>>>> and even that it runs correctly if "myapp.py" doesn't invoke any
>>>>> MPI calls of its own.
>>>>>
>>>>> I'm testing openMPI on a single box, so there's no machinefile-stuff
>>>>> currently
>>>>> active. The system is running Fedora Core 6 x86-64, I'm using the latest
>>>>> openmpi-1.2.3-1.src.rpm rebuilt on the machine in question,
>>>>> I can provide additional configuration details if necessary.
>>>>>
>>>>> Thanks, in advance, for any help or advice,
>>>>>
>>>>> Lev
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------
>>>>> Lev Gelb Associate Professor Department of Chemistry, Washington
>>>>> University
>>>>> in
>>>>> St. Louis, St. Louis, MO 63130 USA
>>>>>
>>>>> email: gelb_at_[hidden]
>>>>> phone: (314)935-5026 fax: (314)935-4481
>>>>>
>>>>> http://www.chemistry.wustl.edu/~gelb
>>>>> ------------------------------------------------------------------
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> ------------------------------------------------------------------
> Lev Gelb
> Associate Professor
> Department of Chemistry,
> Washington University in St. Louis,
> St. Louis, MO 63130 USA
>
> email: gelb_at_[hidden]
> phone: (314)935-5026
> fax: (314)935-4481
>
> http://www.chemistry.wustl.edu/~gelb
> ------------------------------------------------------------------
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users