Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Running application with MPI_Comm_spawn() in multithreaded environment
From: Ralph Castain (rhc_at_[hidden])
Date: 2008-10-03 12:13:48


Interesting. I ran a loop calling comm_spawn 1000 times without a
problem. I suspect it is the threading that is causing the trouble here.

You are welcome to send me the code. You can find my loop code in your
code distribution under orte/test/mpi - look for loop_spawn and
loop_child.

Ralph

On Oct 3, 2008, at 9:11 AM, Roberto Fichera wrote:

> Ralph Castain ha scritto:
>>
>> On Oct 3, 2008, at 7:14 AM, Roberto Fichera wrote:
>>
>>> Ralph Castain ha scritto:
>>>> I committed something to the trunk yesterday. Given the
>>>> complexity of
>>>> the fix, I don't plan to bring it over to the 1.3 branch until
>>>> sometime mid-to-end next week so it can be adequately tested.
>>> Ok! So it means that I can checkout from the SVN/trunk to get you
>>> fix,
>>> right?
>>
>> Yes, though note that I don't claim it is fully correct yet. Still
>> needs testing. However, I have tested it a fair amount and it seems
>> okay.
>>
>> If you do test it, please let me know how it goes.
> I execute my test on the svn/trunk below
>
> Open MPI: 1.4a1r19677
> Open MPI SVN revision: r19677
> Open MPI release date: Unreleased developer copy
> Open RTE: 1.4a1r19677
> Open RTE SVN revision: r19677
> Open RTE release date: Unreleased developer copy
> OPAL: 1.4a1r19677
> OPAL SVN revision: r19677
> OPAL release date: Unreleased developer copy
> Ident string: 1.4a1r19677
>
> below is the output which seems to freeze just after the second spawn.
>
> [roberto_at_master TestOpenMPI]$ mpirun --verbose --debug-daemons
> --hostfile $PBS_NODEFILE -wdir "`pwd`" -np 1 testmaster 100000
> $PBS_NODEFILE
> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
> add_local_procs
> [master.tekno-soft.it:30063] [[19516,0],0] node[0].name master
> daemon 0
> arch ffc91200
> [master.tekno-soft.it:30063] [[19516,0],0] node[1].name cluster4
> daemon
> INVALID arch ffc91200
> [master.tekno-soft.it:30063] [[19516,0],0] node[2].name cluster3
> daemon
> INVALID arch ffc91200
> [master.tekno-soft.it:30063] [[19516,0],0] node[3].name cluster2
> daemon
> INVALID arch ffc91200
> [master.tekno-soft.it:30063] [[19516,0],0] node[4].name cluster1
> daemon
> INVALID arch ffc91200
> Initializing MPI ...
> [master.tekno-soft.it:30063] [[19516,0],0] orted_recv: received
> sync+nidmap from local proc [[19516,1],0]
> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
> collective data cmd
> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
> message_local_procs
> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
> collective data cmd
> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
> message_local_procs
> Loading the node's ring from file
> '/var/torque/aux//932.master.tekno-soft.it'
> ... adding node #1 host is 'cluster4.tekno-soft.it'
> ... adding node #2 host is 'cluster3.tekno-soft.it'
> ... adding node #3 host is 'cluster2.tekno-soft.it'
> ... adding node #4 host is 'cluster1.tekno-soft.it'
> A 4 node's ring has been made
> At least one node is available, let's start to distribute 100000 job
> across 4 nodes!!!
> Setting up the host as 'cluster4.tekno-soft.it'
> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
> Spawning a task 'testslave.sh' on node 'cluster4.tekno-soft.it'
> Daemon was launched on cluster4.tekno-soft.it - beginning to
> initialize
> Daemon [[19516,0],1] checking in as pid 25123 on host cluster4.tekno-
> soft.it
> Daemon [[19516,0],1] not using static ports
> [cluster4.tekno-soft.it:25123] [[19516,0],1] orted: up and running -
> waiting for commands!
> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
> add_local_procs
> [master.tekno-soft.it:30063] [[19516,0],0] node[0].name master
> daemon 0
> arch ffc91200
> [master.tekno-soft.it:30063] [[19516,0],0] node[1].name cluster4
> daemon
> 1 arch ffc91200
> [master.tekno-soft.it:30063] [[19516,0],0] node[2].name cluster3
> daemon
> INVALID arch ffc91200
> [master.tekno-soft.it:30063] [[19516,0],0] node[3].name cluster2
> daemon
> INVALID arch ffc91200
> [master.tekno-soft.it:30063] [[19516,0],0] node[4].name cluster1
> daemon
> INVALID arch ffc91200
> [cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received
> add_local_procs
> [cluster4.tekno-soft.it:25123] [[19516,0],1] node[0].name master
> daemon
> 0 arch ffc91200
> [cluster4.tekno-soft.it:25123] [[19516,0],1] node[1].name cluster4
> daemon 1 arch ffc91200
> [cluster4.tekno-soft.it:25123] [[19516,0],1] node[2].name cluster3
> daemon INVALID arch ffc91200
> [cluster4.tekno-soft.it:25123] [[19516,0],1] node[3].name cluster2
> daemon INVALID arch ffc91200
> [cluster4.tekno-soft.it:25123] [[19516,0],1] node[4].name cluster1
> daemon INVALID arch ffc91200
> [cluster4.tekno-soft.it:25123] [[19516,0],1] orted_recv: received
> sync+nidmap from local proc [[19516,2],0]
> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
> collective data cmd
> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
> message_local_procs
> [cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received
> collective data cmd
> [cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received
> message_local_procs
> [cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received
> collective data cmd
> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
> collective data cmd
> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
> message_local_procs
> [cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received
> message_local_procs
>
> Let me know if you need my test program.
>
>>
>> Thanks
>> Ralph
>>
>>>
>>>> Ralph
>>>>
>>>> On Oct 3, 2008, at 5:02 AM, Roberto Fichera wrote:
>>>>
>>>>> Ralph Castain ha scritto:
>>>>>> Actually, it just occurred to me that you may be seeing a
>>>>>> problem in
>>>>>> comm_spawn itself that I am currently chasing down. It is in
>>>>>> the 1.3
>>>>>> branch and has to do with comm_spawning procs on subsets of nodes
>>>>>> (instead of across all nodes). Could be related to this - you
>>>>>> might
>>>>>> want to give me a chance to complete the fix. I have identified
>>>>>> the
>>>>>> problem and should have it fixed later today in our trunk -
>>>>>> probably
>>>>>> won't move to the 1.3 branch for several days.
>>>>> Do you have any news about the above fix? Does the fix is already
>>>>> available for testing?
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users