Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Lev Gelb (gelb_at_[hidden])
Date: 2007-07-11 15:40:26


OK, I've added the debug flags - when I add them to the
os.system instance of orterun, there is no additional input,
but when I add them to the orterun instance controlling the
python program, I get the following:

>orterun -np 1 --debug-daemons -mca odls_base_verbose 1 python ./test.py
Daemon [0,0,1] checking in as pid 18054 on host druid.wustl.edu
[druid.wustl.edu:18054] [0,0,1] orted: received launch callback
[druid.wustl.edu:18054] odls: setting up launch for job 1
[druid.wustl.edu:18054] odls: overriding oversubscription
[druid.wustl.edu:18054] odls: oversubscribed set to false want_processor
set to true
[druid.wustl.edu:18054] odls: preparing to launch child [0, 1, 0]
Pypar (version 1.9.3) initialised MPI OK with 1 processors
[druid.wustl.edu:18057] OOB: Connection to HNP lost
[druid.wustl.edu:18054] odls: child process terminated
[druid.wustl.edu:18054] odls: child process [0,1,0] terminated normally
[druid.wustl.edu:18054] [0,0,1] orted_recv_pls: received message from
[0,0,0]
[druid.wustl.edu:18054] [0,0,1] orted_recv_pls: received exit
[druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: working on job -1
[druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: checking child
process [0,1,0]
[druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: child is not alive

(the Pypar output is from loading that module; the next thing in
the code is the os.system call to start orterun with 2 processors.)

Also, there is absolutely no output from the second orterun-launched
program (even the first line does not execute.)

Cheers,

Lev

> Message: 5
> Date: Wed, 11 Jul 2007 13:26:22 -0600
> From: Ralph H Castain <rhc_at_[hidden]>
> Subject: Re: [OMPI users] Recursive use of "orterun"
> To: "Open MPI Users <users_at_[hidden]>" <users_at_[hidden]>
> Message-ID: <C2BA8AFE.9E64%rhc_at_[hidden]>
> Content-Type: text/plain; charset="US-ASCII"
>
> I'm unaware of any issues that would cause it to fail just because it is
> being run via that interface.
>
> The error message is telling us that the procs got launched, but then
> orterun went away unexpectedly. Are you seeing your procs complete? We do
> sometimes see that message due to a race condition between the daemons
> spawned to support the application procs and orterun itself (see other
> recent notes in this forum).
>
> If your procs are not completing, then it would mean that either the
> connecting fabric is failing for some reason, or orterun is terminating
> early. If you could add --debug-daemons -mca odls_base_verbose 1 to the
> os.system command, the output from that might help us understand why it is
> failing.
>
> Ralph
>
>
>
> On 7/11/07 10:49 AM, "Lev Gelb" <gelb_at_[hidden]> wrote:
>
>>
>> Hi -
>>
>> I'm trying to port an application to use OpenMPI, and running
>> into a problem. The program (written in Python, parallelized
>> using either of "pypar" or "pyMPI") itself invokes "mpirun"
>> in order to manage external, parallel processes, via something like:
>>
>> orterun -np 2 python myapp.py
>>
>> where myapp.py contains:
>>
>> os.system('orterun -np 2 nwchem.x nwchem.inp > nwchem.out')
>>
>> I have this working under both LAM-MPI and MPICH on a variety
>> of different machines. However, with OpenMPI, all I get is an
>> immediate return from the system call and the error:
>>
>> "OOB: Connection to HNP lost"
>>
>> I have verified that the command passed to os.system is correct,
>> and even that it runs correctly if "myapp.py" doesn't invoke any
>> MPI calls of its own.
>>
>> I'm testing openMPI on a single box, so there's no machinefile-stuff currently
>> active. The system is running Fedora Core 6 x86-64, I'm using the latest
>> openmpi-1.2.3-1.src.rpm rebuilt on the machine in question,
>> I can provide additional configuration details if necessary.
>>
>> Thanks, in advance, for any help or advice,
>>
>> Lev
>>
>>
>> ------------------------------------------------------------------
>> Lev Gelb Associate Professor Department of Chemistry, Washington University in
>> St. Louis, St. Louis, MO 63130 USA
>>
>> email: gelb_at_[hidden]
>> phone: (314)935-5026 fax: (314)935-4481
>>
>> http://www.chemistry.wustl.edu/~gelb
>> ------------------------------------------------------------------
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users