Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Running application with MPI_Comm_spawn() in multithreaded environment
From: Roberto Fichera (kernel_at_[hidden])
Date: 2008-10-03 11:11:13


Ralph Castain ha scritto:
>
> On Oct 3, 2008, at 7:14 AM, Roberto Fichera wrote:
>
>> Ralph Castain ha scritto:
>>> I committed something to the trunk yesterday. Given the complexity of
>>> the fix, I don't plan to bring it over to the 1.3 branch until
>>> sometime mid-to-end next week so it can be adequately tested.
>> Ok! So it means that I can checkout from the SVN/trunk to get you fix,
>> right?
>
> Yes, though note that I don't claim it is fully correct yet. Still
> needs testing. However, I have tested it a fair amount and it seems okay.
>
> If you do test it, please let me know how it goes.
I execute my test on the svn/trunk below

                Open MPI: 1.4a1r19677
   Open MPI SVN revision: r19677
   Open MPI release date: Unreleased developer copy
                Open RTE: 1.4a1r19677
   Open RTE SVN revision: r19677
   Open RTE release date: Unreleased developer copy
                    OPAL: 1.4a1r19677
       OPAL SVN revision: r19677
       OPAL release date: Unreleased developer copy
            Ident string: 1.4a1r19677
 
below is the output which seems to freeze just after the second spawn.

[roberto_at_master TestOpenMPI]$ mpirun --verbose --debug-daemons
--hostfile $PBS_NODEFILE -wdir "`pwd`" -np 1 testmaster 100000 $PBS_NODEFILE
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
add_local_procs
[master.tekno-soft.it:30063] [[19516,0],0] node[0].name master daemon 0
arch ffc91200
[master.tekno-soft.it:30063] [[19516,0],0] node[1].name cluster4 daemon
INVALID arch ffc91200
[master.tekno-soft.it:30063] [[19516,0],0] node[2].name cluster3 daemon
INVALID arch ffc91200
[master.tekno-soft.it:30063] [[19516,0],0] node[3].name cluster2 daemon
INVALID arch ffc91200
[master.tekno-soft.it:30063] [[19516,0],0] node[4].name cluster1 daemon
INVALID arch ffc91200
Initializing MPI ...
[master.tekno-soft.it:30063] [[19516,0],0] orted_recv: received
sync+nidmap from local proc [[19516,1],0]
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
collective data cmd
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
message_local_procs
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
collective data cmd
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
message_local_procs
Loading the node's ring from file
'/var/torque/aux//932.master.tekno-soft.it'
... adding node #1 host is 'cluster4.tekno-soft.it'
... adding node #2 host is 'cluster3.tekno-soft.it'
... adding node #3 host is 'cluster2.tekno-soft.it'
... adding node #4 host is 'cluster1.tekno-soft.it'
A 4 node's ring has been made
At least one node is available, let's start to distribute 100000 job
across 4 nodes!!!
Setting up the host as 'cluster4.tekno-soft.it'
Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
Spawning a task 'testslave.sh' on node 'cluster4.tekno-soft.it'
Daemon was launched on cluster4.tekno-soft.it - beginning to initialize
Daemon [[19516,0],1] checking in as pid 25123 on host cluster4.tekno-soft.it
Daemon [[19516,0],1] not using static ports
[cluster4.tekno-soft.it:25123] [[19516,0],1] orted: up and running -
waiting for commands!
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
add_local_procs
[master.tekno-soft.it:30063] [[19516,0],0] node[0].name master daemon 0
arch ffc91200
[master.tekno-soft.it:30063] [[19516,0],0] node[1].name cluster4 daemon
1 arch ffc91200
[master.tekno-soft.it:30063] [[19516,0],0] node[2].name cluster3 daemon
INVALID arch ffc91200
[master.tekno-soft.it:30063] [[19516,0],0] node[3].name cluster2 daemon
INVALID arch ffc91200
[master.tekno-soft.it:30063] [[19516,0],0] node[4].name cluster1 daemon
INVALID arch ffc91200
[cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received
add_local_procs
[cluster4.tekno-soft.it:25123] [[19516,0],1] node[0].name master daemon
0 arch ffc91200
[cluster4.tekno-soft.it:25123] [[19516,0],1] node[1].name cluster4
daemon 1 arch ffc91200
[cluster4.tekno-soft.it:25123] [[19516,0],1] node[2].name cluster3
daemon INVALID arch ffc91200
[cluster4.tekno-soft.it:25123] [[19516,0],1] node[3].name cluster2
daemon INVALID arch ffc91200
[cluster4.tekno-soft.it:25123] [[19516,0],1] node[4].name cluster1
daemon INVALID arch ffc91200
[cluster4.tekno-soft.it:25123] [[19516,0],1] orted_recv: received
sync+nidmap from local proc [[19516,2],0]
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
collective data cmd
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
message_local_procs
[cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received
collective data cmd
[cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received
message_local_procs
[cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received
collective data cmd
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
collective data cmd
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
message_local_procs
[cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received
message_local_procs

Let me know if you need my test program.

>
> Thanks
> Ralph
>
>>
>>> Ralph
>>>
>>> On Oct 3, 2008, at 5:02 AM, Roberto Fichera wrote:
>>>
>>>> Ralph Castain ha scritto:
>>>>> Actually, it just occurred to me that you may be seeing a problem in
>>>>> comm_spawn itself that I am currently chasing down. It is in the 1.3
>>>>> branch and has to do with comm_spawning procs on subsets of nodes
>>>>> (instead of across all nodes). Could be related to this - you might
>>>>> want to give me a chance to complete the fix. I have identified the
>>>>> problem and should have it fixed later today in our trunk - probably
>>>>> won't move to the 1.3 branch for several days.
>>>> Do you have any news about the above fix? Does the fix is already
>>>> available for testing?
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>