Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Running application with MPI_Comm_spawn() in multithreaded environment
From: Roberto Fichera (kernel_at_[hidden])
Date: 2008-10-06 10:07:37


Ralph Castain ha scritto:
> Hi Roberto
>
> My time is somewhat limited, so I couldn't review the code in detail.
> However, I think I got the gist of it.
>
> A few observations:
>
> 1. the code is rather inefficient, if all you want to do is spawn a
> pattern of slave processes based on a file. Unless there is some
> overriding reason for doing this one comm_spawn at a time, it would be
> far faster to issue a single comm_spawn and just provide the hostfile
> to us. You could use either the seq or rank_file mapper - both would
> take the file and provide the outcome you seek. The only difference
> would be that the child procs would all be in the same comm_world -
> don't know if that is an issue or not.
I agree with you if the all the spawned slaves has to communicate in the
same
comm_world, but since that's not case, as I already told you each slave will
crunch different things, maybe also completly different data from the
other slaves.
The job's distribution will look like a tree or multi tree. The main
problem is that we
don't know in advance how to associate the slaves to the node, so we
need a very
dynamic distribution of the jobs while "unrolling" the algorithm,
basically the application
need to decide which is the best slave to run so that it can locally
converge in the solution
as better it can. That's why our distribution is quite *unusal* ... but,
I would say, legal in
the MPI-2 specs terms.
> 2. OMPI definitely cannot handle the threaded version of this code at
> this time - not sure when we will get to it.
We are talking about only MPI_Comm_spawn() or in general the whole
OpenMPI is not thread safe at moment?
> 3. if you serialize the code, we -should- be able to handle it.
> However, I'm not entirely sure your current method actually does that.
> It looks like you call comm_spawn, and then create a new thread which
> then calls comm_spawn. I'm afraid I can't quite figure out how the
> thread locking would occur to prevent multiple threads continuing to
> call comm_spawn - you might want to check it again and ensure it is
> correct. Frankly, I'm not entirely sure what the thread creation is
> gaining you - as I said, we can only call comm_spawn serially, so
> having multiple threads would seem to be unnecessary...unless this
> code is incomplete and you need the threads for some other purpose.
Could you explain, which part I have to serialize in order to meet the
OpenMPI expectation? Can I send/receive
in a multithreading fashion, for example?

I need threading because each thread will handle one communication with
one slave. In my code, the comm_spawn() is called in a thread
and in the same thread I'll drive the MPI communication, when the slave
computation is terminated the thread will accordly terminate.
> Again, you might look at that loop_spawn code I mentioned before to
> see a working example. Alternatively, if your code works under HP MPI,
> you might want to stick with it for now until we get the threading
> support up to your required level.
About your example, I see that it does permit to merge slaves in one
intercommunicator, but what happen in
the intercommunicator if one or more slaves complete their work and I
would reuse it for doing other things.
Basically I need to pair the slave with the data to send it for its
related computation, that's why we create
different intercommunicator one for each slave because they aren't
related or at least we locally decide if we
need more than one slave for crunching data, in that case the spawn will
be instrumented to spawn say 10
nodes for a single computation. So only that case we "fall back" in the
"standard usage" ;-)!
>
> Hope that helps
> Ralph
>
> On Oct 3, 2008, at 10:36 AM, Roberto Fichera wrote:
>
>> Ralph Castain ha scritto:
>>> Interesting. I ran a loop calling comm_spawn 1000 times without a
>>> problem. I suspect it is the threading that is causing the trouble
>>> here.
>> I think so! My guessing is that at low level there is some trouble when
>> handling *concurrent*
>> orted spawning. Maybe
>>> You are welcome to send me the code. You can find my loop code in your
>>> code distribution under orte/test/mpi - look for loop_spawn and
>>> loop_child.
>> In the attached code the spawing logic is currently under a loop in the
>> main of the testmaster, so it's completly
>> unthreaded at least until the MPI_Comm_spawn() terminate its work. If
>> you wish like to test multithreading spawing
>> you can comment the NodeThread_spawnSlave() in the main loop and
>> uncomment the same function in the
>> NodeThread_threadMain(). Finally if you want multithreading spawning but
>> serialized against a mutex than uncomment
>> the pthread_mutex_lock/unlock() in the NodeThread_threadMain().
>>
>> This code run *without* any trouble in the HP MPI implementation. It
>> works not so well in mpich2 trunk version due
>> to two problems: limit of ~24.4K context id and/or a race in poll()
>> while waiting a termination under MPI_Comm_disconnect()
>> concurrently with a MPI_Comm_spawn().
>>
>>>
>>> Ralph
>>>
>>> On Oct 3, 2008, at 9:11 AM, Roberto Fichera wrote:
>>>
>>>> Ralph Castain ha scritto:
>>>>>
>>>>> On Oct 3, 2008, at 7:14 AM, Roberto Fichera wrote:
>>>>>
>>>>>> Ralph Castain ha scritto:
>>>>>>> I committed something to the trunk yesterday. Given the
>>>>>>> complexity of
>>>>>>> the fix, I don't plan to bring it over to the 1.3 branch until
>>>>>>> sometime mid-to-end next week so it can be adequately tested.
>>>>>> Ok! So it means that I can checkout from the SVN/trunk to get you
>>>>>> fix,
>>>>>> right?
>>>>>
>>>>> Yes, though note that I don't claim it is fully correct yet. Still
>>>>> needs testing. However, I have tested it a fair amount and it seems
>>>>> okay.
>>>>>
>>>>> If you do test it, please let me know how it goes.
>>>> I execute my test on the svn/trunk below
>>>>
>>>> Open MPI: 1.4a1r19677
>>>> Open MPI SVN revision: r19677
>>>> Open MPI release date: Unreleased developer copy
>>>> Open RTE: 1.4a1r19677
>>>> Open RTE SVN revision: r19677
>>>> Open RTE release date: Unreleased developer copy
>>>> OPAL: 1.4a1r19677
>>>> OPAL SVN revision: r19677
>>>> OPAL release date: Unreleased developer copy
>>>> Ident string: 1.4a1r19677
>>>>
>>>> below is the output which seems to freeze just after the second spawn.
>>>>
>>>> [roberto_at_master TestOpenMPI]$ mpirun --verbose --debug-daemons
>>>> --hostfile $PBS_NODEFILE -wdir "`pwd`" -np 1 testmaster 100000
>>>> $PBS_NODEFILE
>>>> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
>>>> add_local_procs
>>>> [master.tekno-soft.it:30063] [[19516,0],0] node[0].name master
>>>> daemon 0
>>>> arch ffc91200
>>>> [master.tekno-soft.it:30063] [[19516,0],0] node[1].name cluster4
>>>> daemon
>>>> INVALID arch ffc91200
>>>> [master.tekno-soft.it:30063] [[19516,0],0] node[2].name cluster3
>>>> daemon
>>>> INVALID arch ffc91200
>>>> [master.tekno-soft.it:30063] [[19516,0],0] node[3].name cluster2
>>>> daemon
>>>> INVALID arch ffc91200
>>>> [master.tekno-soft.it:30063] [[19516,0],0] node[4].name cluster1
>>>> daemon
>>>> INVALID arch ffc91200
>>>> Initializing MPI ...
>>>> [master.tekno-soft.it:30063] [[19516,0],0] orted_recv: received
>>>> sync+nidmap from local proc [[19516,1],0]
>>>> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
>>>> collective data cmd
>>>> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
>>>> message_local_procs
>>>> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
>>>> collective data cmd
>>>> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
>>>> message_local_procs
>>>> Loading the node's ring from file
>>>> '/var/torque/aux//932.master.tekno-soft.it'
>>>> ... adding node #1 host is 'cluster4.tekno-soft.it'
>>>> ... adding node #2 host is 'cluster3.tekno-soft.it'
>>>> ... adding node #3 host is 'cluster2.tekno-soft.it'
>>>> ... adding node #4 host is 'cluster1.tekno-soft.it'
>>>> A 4 node's ring has been made
>>>> At least one node is available, let's start to distribute 100000 job
>>>> across 4 nodes!!!
>>>> Setting up the host as 'cluster4.tekno-soft.it'
>>>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>>>> Spawning a task 'testslave.sh' on node 'cluster4.tekno-soft.it'
>>>> Daemon was launched on cluster4.tekno-soft.it - beginning to
>>>> initialize
>>>> Daemon [[19516,0],1] checking in as pid 25123 on host
>>>> cluster4.tekno-soft.it
>>>> Daemon [[19516,0],1] not using static ports
>>>> [cluster4.tekno-soft.it:25123] [[19516,0],1] orted: up and running -
>>>> waiting for commands!
>>>> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
>>>> add_local_procs
>>>> [master.tekno-soft.it:30063] [[19516,0],0] node[0].name master
>>>> daemon 0
>>>> arch ffc91200
>>>> [master.tekno-soft.it:30063] [[19516,0],0] node[1].name cluster4
>>>> daemon
>>>> 1 arch ffc91200
>>>> [master.tekno-soft.it:30063] [[19516,0],0] node[2].name cluster3
>>>> daemon
>>>> INVALID arch ffc91200
>>>> [master.tekno-soft.it:30063] [[19516,0],0] node[3].name cluster2
>>>> daemon
>>>> INVALID arch ffc91200
>>>> [master.tekno-soft.it:30063] [[19516,0],0] node[4].name cluster1
>>>> daemon
>>>> INVALID arch ffc91200
>>>> [cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received
>>>> add_local_procs
>>>> [cluster4.tekno-soft.it:25123] [[19516,0],1] node[0].name master
>>>> daemon
>>>> 0 arch ffc91200
>>>> [cluster4.tekno-soft.it:25123] [[19516,0],1] node[1].name cluster4
>>>> daemon 1 arch ffc91200
>>>> [cluster4.tekno-soft.it:25123] [[19516,0],1] node[2].name cluster3
>>>> daemon INVALID arch ffc91200
>>>> [cluster4.tekno-soft.it:25123] [[19516,0],1] node[3].name cluster2
>>>> daemon INVALID arch ffc91200
>>>> [cluster4.tekno-soft.it:25123] [[19516,0],1] node[4].name cluster1
>>>> daemon INVALID arch ffc91200
>>>> [cluster4.tekno-soft.it:25123] [[19516,0],1] orted_recv: received
>>>> sync+nidmap from local proc [[19516,2],0]
>>>> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
>>>> collective data cmd
>>>> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
>>>> message_local_procs
>>>> [cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received
>>>> collective data cmd
>>>> [cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received
>>>> message_local_procs
>>>> [cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received
>>>> collective data cmd
>>>> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
>>>> collective data cmd
>>>> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
>>>> message_local_procs
>>>> [cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received
>>>> message_local_procs
>>>>
>>>> Let me know if you need my test program.
>>>>
>>>>>
>>>>> Thanks
>>>>> Ralph
>>>>>
>>>>>>
>>>>>>> Ralph
>>>>>>>
>>>>>>> On Oct 3, 2008, at 5:02 AM, Roberto Fichera wrote:
>>>>>>>
>>>>>>>> Ralph Castain ha scritto:
>>>>>>>>> Actually, it just occurred to me that you may be seeing a
>>>>>>>>> problem in
>>>>>>>>> comm_spawn itself that I am currently chasing down. It is in the
>>>>>>>>> 1.3
>>>>>>>>> branch and has to do with comm_spawning procs on subsets of nodes
>>>>>>>>> (instead of across all nodes). Could be related to this - you
>>>>>>>>> might
>>>>>>>>> want to give me a chance to complete the fix. I have
>>>>>>>>> identified the
>>>>>>>>> problem and should have it fixed later today in our trunk -
>>>>>>>>> probably
>>>>>>>>> won't move to the 1.3 branch for several days.
>>>>>>>> Do you have any news about the above fix? Does the fix is already
>>>>>>>> available for testing?
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> <testspawn.tar.bz2>_______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>