Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Running application with MPI_Comm_spawn() in multithreaded environment
From: Ralph Castain (rhc_at_[hidden])
Date: 2008-10-03 08:54:34


I committed something to the trunk yesterday. Given the complexity of
the fix, I don't plan to bring it over to the 1.3 branch until
sometime mid-to-end next week so it can be adequately tested.

Ralph

On Oct 3, 2008, at 5:02 AM, Roberto Fichera wrote:

> Ralph Castain ha scritto:
>> Actually, it just occurred to me that you may be seeing a problem in
>> comm_spawn itself that I am currently chasing down. It is in the 1.3
>> branch and has to do with comm_spawning procs on subsets of nodes
>> (instead of across all nodes). Could be related to this - you might
>> want to give me a chance to complete the fix. I have identified the
>> problem and should have it fixed later today in our trunk - probably
>> won't move to the 1.3 branch for several days.
> Do you have any news about the above fix? Does the fix is already
> available for testing?
>>
>> Ralph
>>
>> On Oct 1, 2008, at 10:43 AM, Ralph Castain wrote:
>>
>>> Afraid I am somewhat at a loss. The logs indicate that mpirun itself
>>> is having problems, likely caused by the threading. Only thing I can
>>> suggest is that you "unthread" the spawning loop and try it that way
>>> first so we can see if some underlying problem exists.
>>>
>>> FWIW: I have run a loop over calls to comm_spawn without problems.
>>> However, there are system limits to the number of child processes an
>>> orted can create. You may hit those at some point - we try to report
>>> that as a separate error when we see it, but it isn't always easy to
>>> catch.
>>>
>>> Like I said, we really don't support threaded operations like this
>>> right now, so I have no idea what your app may be triggering. I
>>> would
>>> definitely try it "unthreaded" if possible.
>>>
>>> Ralph
>>>
>>>
>>> On Oct 1, 2008, at 9:04 AM, Roberto Fichera wrote:
>>>
>>>> Ralph Castain ha scritto:
>>>>> Okay, I believe I understand the problem. What this error is
>>>>> telling
>>>>> you is that the Torque MOM is refusing our connection request
>>>>> because
>>>>> it is already busy. So we cannot spawn another process.
>>>>>
>>>>> If I understand your application correctly, you are spinning off
>>>>> multiple threads, each attempting to comm_spawn a single process -
>>>>> true? The problem with that design is that - since OMPI is not
>>>>> thread
>>>>> safe yet - these threads are all attempting to connect to the
>>>>> MOM at
>>>>> the same time. The MOM will only allow one connection at a time,
>>>>> and
>>>>> so at some point we are requesting a connection while already
>>>>> connected.
>>>>>
>>>>> Since we are some ways off from attaining thread safety in these
>>>>> scenarios, you really have three choices:
>>>>>
>>>>> 1. you could do this with a single comm_spawn call. Remember,
>>>>> you can
>>>>> provide an MPI_Info key to comm_spawn essentially telling it
>>>>> where to
>>>>> place the various process ranks. Unless you truly want each new
>>>>> process to be in its own comm_world, there is no real need to do
>>>>> this
>>>>> with 10000 individual calls to comm_spawn.
>>>> I need only a master to single slave communication, the slaves
>>>> *doesn't*
>>>> need to communicate all together. The logic within the test program
>>>> is quite easy, it will dispatch as many jobs as the user need
>>>> across
>>>> the
>>>> assigned nodes, try to take it busy as much as possible. That's
>>>> because
>>>> our algorithms need a tree evolution where a node is master of a
>>>> bounch
>>>> of slaves, and a slave can be a sub-master of a bounch of slaves,
>>>> this
>>>> depends by how the each leaf will evolve in its computation.
>>>> Generally
>>>> we don't go to much than 5 o 6 levels in deep. But we need a very
>>>> dynamic logic for dispatching jobs.
>>>>> 2. you could execute your own thread locking scheme in your
>>>>> application so that only one thread calls comm_spawn at a time.
>>>> I done it with and without _tm_ support and using a mutex to
>>>> serialize
>>>> the MPI_Comm_spawn().
>>>> The log below is using the torque/pbs support compiled in:
>>>>
>>>> [roberto_at_master TestOpenMPI]$ mpirun --verbose --debug-daemons -
>>>> wdir
>>>> "`pwd`" -np 1 testmaster 100000 $PBS_NODEFILE
>>>> [master.tekno-soft.it:07844] [[10231,0],0] orted_cmd: received
>>>> add_local_procs
>>>> [master.tekno-soft.it:07844] [[10231,0],0] node[0].name master
>>>> daemon 0
>>>> arch ffc91200
>>>> [master.tekno-soft.it:07844] [[10231,0],0] node[1].name cluster4
>>>> daemon
>>>> INVALID arch ffc91200
>>>> [master.tekno-soft.it:07844] [[10231,0],0] node[2].name cluster3
>>>> daemon
>>>> INVALID arch ffc91200
>>>> [master.tekno-soft.it:07844] [[10231,0],0] node[3].name cluster2
>>>> daemon
>>>> INVALID arch ffc91200
>>>> [master.tekno-soft.it:07844] [[10231,0],0] node[4].name cluster1
>>>> daemon
>>>> INVALID arch ffc91200
>>>> Initializing MPI ...
>>>> [master.tekno-soft.it:07844] [[10231,0],0] orted_recv: received
>>>> sync+nidmap from local proc [[10231,1],0]
>>>> [master.tekno-soft.it:07844] [[10231,0],0] orted_cmd: received
>>>> collective data cmd
>>>> [master.tekno-soft.it:07844] [[10231,0],0] orted_cmd: received
>>>> message_local_procs
>>>> [master.tekno-soft.it:07844] [[10231,0],0] orted_cmd: received
>>>> collective data cmd
>>>> [master.tekno-soft.it:07844] [[10231,0],0] orted_cmd: received
>>>> message_local_procs
>>>> Loading the node's ring from file
>>>> '/var/torque/aux//929.master.tekno-soft.it'
>>>> ... adding node #1 host is 'cluster4.tekno-soft.it'
>>>> ... adding node #2 host is 'cluster3.tekno-soft.it'
>>>> ... adding node #3 host is 'cluster2.tekno-soft.it'
>>>> ... adding node #4 host is 'cluster1.tekno-soft.it'
>>>> A 4 node's ring has been made
>>>> At least one node is available, let's start to distribute 100000
>>>> job
>>>> across 4 nodes!!!
>>>> ****************** Starting job #1
>>>> ****************** Starting job #2
>>>> ****************** Starting job #3
>>>> ****************** Starting job #4
>>>> Setting up the host as 'cluster4.tekno-soft.it'
>>>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>>>> Spawning a task 'testslave.sh' on node 'cluster4.tekno-soft.it'
>>>> Setting up the host as 'cluster3.tekno-soft.it'
>>>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>>>> Spawning a task 'testslave.sh' on node 'cluster3.tekno-soft.it'
>>>> Setting up the host as 'cluster2.tekno-soft.it'
>>>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>>>> Spawning a task 'testslave.sh' on node 'cluster2.tekno-soft.it'
>>>> Setting up the host as 'cluster1.tekno-soft.it'
>>>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>>>> Spawning a task 'testslave.sh' on node 'cluster1.tekno-soft.it'
>>>> Daemon was launched on cluster4.tekno-soft.it - beginning to
>>>> initialize
>>>> Daemon [[10231,0],1] checking in as pid 4869 on host
>>>> cluster4.tekno-soft.it
>>>> Daemon [[10231,0],1] not using static ports
>>>> [cluster4.tekno-soft.it:04869] [[10231,0],1] orted: up and
>>>> running -
>>>> waiting for commands!
>>>> [master.tekno-soft.it:07844] [[10231,0],0] orted_cmd: received
>>>> add_local_procs
>>>> [master.tekno-soft.it:07844] [[10231,0],0] node[0].name master
>>>> daemon 0
>>>> arch ffc91200
>>>> [master.tekno-soft.it:07844] [[10231,0],0] node[1].name cluster4
>>>> daemon
>>>> 1 arch ffc91200
>>>> [master.tekno-soft.it:07844] [[10231,0],0] node[2].name cluster3
>>>> daemon
>>>> INVALID arch ffc91200
>>>> [master.tekno-soft.it:07844] [[10231,0],0] node[3].name cluster2
>>>> daemon
>>>> INVALID arch ffc91200
>>>> [master.tekno-soft.it:07844] [[10231,0],0] node[4].name cluster1
>>>> daemon
>>>> INVALID arch ffc91200
>>>> [cluster4.tekno-soft.it:04869] [[10231,0],1] orted_cmd: received
>>>> add_local_procs
>>>> [cluster4.tekno-soft.it:04869] [[10231,0],1] node[0].name master
>>>> daemon
>>>> 0 arch ffc91200
>>>> [cluster4.tekno-soft.it:04869] [[10231,0],1] node[1].name cluster4
>>>> daemon 1 arch ffc91200
>>>> [cluster4.tekno-soft.it:04869] [[10231,0],1] node[2].name cluster3
>>>> daemon INVALID arch ffc91200
>>>> [cluster4.tekno-soft.it:04869] [[10231,0],1] node[3].name cluster2
>>>> daemon INVALID arch ffc91200
>>>> [cluster4.tekno-soft.it:04869] [[10231,0],1] node[4].name cluster1
>>>> daemon INVALID arch ffc91200
>>>> [cluster4.tekno-soft.it:04869] [[10231,0],1] orted_recv: received
>>>> sync+nidmap from local proc [[10231,2],0]
>>>> [cluster4.tekno-soft.it:04869] [[10231,0],1] orted_cmd: received
>>>> collective data cmd
>>>> [master.tekno-soft.it:07844] [[10231,0],0] orted_cmd: received
>>>> collective data cmd
>>>> [master.tekno-soft.it:07844] [[10231,0],0] orted_cmd: received
>>>> message_local_procs
>>>> [cluster4.tekno-soft.it:04869] [[10231,0],1] orted_cmd: received
>>>> message_local_procs
>>>> [cluster4.tekno-soft.it:04869] [[10231,0],1] orted_cmd: received
>>>> collective data cmd
>>>> [master.tekno-soft.it:07844] [[10231,0],0] orted_cmd: received
>>>> collective data cmd
>>>> [master.tekno-soft.it:07844] [[10231,0],0] orted_cmd: received
>>>> message_local_procs
>>>> [cluster4.tekno-soft.it:04869] [[10231,0],1] orted_cmd: received
>>>> message_local_procs
>>>> Killed
>>>> [cluster4.tekno-soft.it:04869] [[10231,0],1] routed:binomial:
>>>> Connection
>>>> to lifeline [[10231,0],0] lost
>>>> [cluster4.tekno-soft.it:04869] [[10231,0],1] routed:binomial:
>>>> Connection
>>>> to lifeline [[10231,0],0] lost
>>>> [roberto_at_master TestOpenMPI]$
>>>>
>>>> this one is *without-tm*
>>>>
>>>> [roberto_at_master TestOpenMPI]$ mpirun --verbose --debug-daemons -
>>>> wdir
>>>> "`pwd`" -np 1 testmaster 100000 $PBS_NODEFILE
>>>> [master.tekno-soft.it:25143] [[23396,0],0] orted_cmd: received
>>>> add_local_procs
>>>> [master.tekno-soft.it:25143] [[23396,0],0] node[0].name master
>>>> daemon 0
>>>> arch ffc91200
>>>> [master.tekno-soft.it:25143] [[23396,0],0] node[1].name cluster4
>>>> daemon
>>>> INVALID arch ffc91200
>>>> [master.tekno-soft.it:25143] [[23396,0],0] node[2].name cluster3
>>>> daemon
>>>> INVALID arch ffc91200
>>>> [master.tekno-soft.it:25143] [[23396,0],0] node[3].name cluster2
>>>> daemon
>>>> INVALID arch ffc91200
>>>> [master.tekno-soft.it:25143] [[23396,0],0] node[4].name cluster1
>>>> daemon
>>>> INVALID arch ffc91200
>>>> Initializing MPI ...
>>>> [master.tekno-soft.it:25143] [[23396,0],0] orted_recv: received
>>>> sync+nidmap from local proc [[23396,1],0]
>>>> [master.tekno-soft.it:25143] [[23396,0],0] orted_cmd: received
>>>> collective data cmd
>>>> [master.tekno-soft.it:25143] [[23396,0],0] orted_cmd: received
>>>> message_local_procs
>>>> [master.tekno-soft.it:25143] [[23396,0],0] orted_cmd: received
>>>> collective data cmd
>>>> [master.tekno-soft.it:25143] [[23396,0],0] orted_cmd: received
>>>> message_local_procs
>>>> Loading the node's ring from file
>>>> '/var/torque/aux//928.master.tekno-soft.it'
>>>> ... adding node #1 host is 'cluster4.tekno-soft.it'
>>>> ... adding node #2 host is 'cluster3.tekno-soft.it'
>>>> ... adding node #3 host is 'cluster2.tekno-soft.it'
>>>> ... adding node #4 host is 'cluster1.tekno-soft.it'
>>>> A 4 node's ring has been made
>>>> At least one node is available, let's start to distribute 100000
>>>> job
>>>> across 4 nodes!!!
>>>> ****************** Starting job #1
>>>> ****************** Starting job #2
>>>> ****************** Starting job #3
>>>> ****************** Starting job #4
>>>> Setting up the host as 'cluster4.tekno-soft.it'
>>>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>>>> Spawning a task 'testslave.sh' on node 'cluster4.tekno-soft.it'
>>>> Setting up the host as 'cluster3.tekno-soft.it'
>>>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>>>> Spawning a task 'testslave.sh' on node 'cluster3.tekno-soft.it'
>>>> Setting up the host as 'cluster2.tekno-soft.it'
>>>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>>>> Spawning a task 'testslave.sh' on node 'cluster2.tekno-soft.it'
>>>> Setting up the host as 'cluster1.tekno-soft.it'
>>>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>>>> Spawning a task 'testslave.sh' on node 'cluster1.tekno-soft.it'
>>>> [master.tekno-soft.it:25143] [[23396,0],0] orted_cmd: received
>>>> add_local_procs
>>>> [master.tekno-soft.it:25143] [[23396,0],0] node[0].name master
>>>> daemon 0
>>>> arch ffc91200
>>>> [master.tekno-soft.it:25143] [[23396,0],0] node[1].name cluster4
>>>> daemon
>>>> 1 arch ffc91200
>>>> [master.tekno-soft.it:25143] [[23396,0],0] node[2].name cluster3
>>>> daemon
>>>> INVALID arch ffc91200
>>>> [master.tekno-soft.it:25143] [[23396,0],0] node[3].name cluster2
>>>> daemon
>>>> INVALID arch ffc91200
>>>> [master.tekno-soft.it:25143] [[23396,0],0] node[4].name cluster1
>>>> daemon
>>>> INVALID arch ffc91200
>>>> Daemon was launched on cluster4.tekno-soft.it - beginning to
>>>> initialize
>>>> Daemon [[23396,0],1] checking in as pid 3653 on host
>>>> cluster4.tekno-soft.it
>>>> Daemon [[23396,0],1] not using static ports
>>>> [cluster4.tekno-soft.it:03653] [[23396,0],1] orted: up and
>>>> running -
>>>> waiting for commands!
>>>> [cluster4.tekno-soft.it:03653] [[23396,0],1] orted_cmd: received
>>>> add_local_procs
>>>> [cluster4.tekno-soft.it:03653] [[23396,0],1] node[0].name master
>>>> daemon
>>>> 0 arch ffc91200
>>>> [cluster4.tekno-soft.it:03653] [[23396,0],1] node[1].name cluster4
>>>> daemon 1 arch ffc91200
>>>> [cluster4.tekno-soft.it:03653] [[23396,0],1] node[2].name cluster3
>>>> daemon INVALID arch ffc91200
>>>> [cluster4.tekno-soft.it:03653] [[23396,0],1] node[3].name cluster2
>>>> daemon INVALID arch ffc91200
>>>> [cluster4.tekno-soft.it:03653] [[23396,0],1] node[4].name cluster1
>>>> daemon INVALID arch ffc91200
>>>> [cluster4.tekno-soft.it:03653] [[23396,0],1] orted_recv: received
>>>> sync+nidmap from local proc [[23396,2],0]
>>>> [cluster4.tekno-soft.it:03653] [[23396,0],1] orted_cmd: received
>>>> collective data cmd
>>>> [master.tekno-soft.it:25143] [[23396,0],0] orted_cmd: received
>>>> collective data cmd
>>>> [master.tekno-soft.it:25143] [[23396,0],0] orted_cmd: received
>>>> message_local_procs
>>>> [cluster4.tekno-soft.it:03653] [[23396,0],1] orted_cmd: received
>>>> message_local_procs
>>>> [cluster4.tekno-soft.it:03653] [[23396,0],1] orted_cmd: received
>>>> collective data cmd
>>>> [master.tekno-soft.it:25143] [[23396,0],0] orted_cmd: received
>>>> collective data cmd
>>>> [master.tekno-soft.it:25143] [[23396,0],0] orted_cmd: received
>>>> message_local_procs
>>>> [cluster4.tekno-soft.it:03653] [[23396,0],1] orted_cmd: received
>>>> message_local_procs
>>>>
>>>> [... got a freeze here ... than ^C ...]
>>>>
>>>> mpirun: killing job...
>>>>
>>>> --------------------------------------------------------------------------
>>>>
>>>> mpirun noticed that process rank 0 with PID 25150 on node
>>>> master.tekno-soft.it exited on signal 0 (Unknown signal 0).
>>>> --------------------------------------------------------------------------
>>>>
>>>> [cluster4.tekno-soft.it:03653] [[23396,0],1] orted_cmd: received
>>>> exit
>>>> [cluster4.tekno-soft.it:03653] [[23396,0],1] orted: finalizing
>>>> mpirun: clean termination accomplished
>>>>
>>>> [cluster4:03653] *** Process received signal ***
>>>> [cluster4:03653] Signal: Segmentation fault (11)
>>>> [cluster4:03653] Signal code: Address not mapped (1)
>>>> [cluster4:03653] Failing at address: 0x2aaaab784af0
>>>>
>>>> So it seem that we have some problem in other places, maybe some
>>>> other
>>>> functions are not
>>>> thread safe.
>>>>
>>>>> 3. remove the threaded launch scenario and just call comm_spawn
>>>>> in a
>>>>> loop.
>>>>>
>>>>> In truth, the threaded approach to spawning all these procs isn't
>>>>> gaining you anything. Torque will only do one launch at a time
>>>>> anyway,
>>>>> so you will launch them serially no matter what. You may just be
>>>>> adding complexity for no real net gain.
>>>> Talking about torque/pbs/maui, that's ok! It doen't handle multiple
>>>> spawn at the same time.
>>>> But in general, if I don't used _tm_ any more, I guess that we can
>>>> get a
>>>> gain on executing parallel spawn,
>>>> because the spawn will be done using ssh/rsh.
>>>>>
>>>>> Ralph
>>>>>
>>>>> On Oct 1, 2008, at 1:56 AM, Roberto Fichera wrote:
>>>>>
>>>>>> Ralph Castain ha scritto:
>>>>>>> Hi Roberto
>>>>>>>
>>>>>>> There is something wrong with this cmd line - perhaps it wasn't
>>>>>>> copied
>>>>>>> correctly?
>>>>>>>
>>>>>>> mpirun --verbose --debug-daemons --mca obl -np 1 -wdir `pwd`
>>>>>>> testmaster 10000 $PBS_NODEFILE
>>>>>>>
>>>>>>> Specifically, the following is incomplete: --mca obl
>>>>>>>
>>>>>>> I'm not sure if this is the problem or not, but I am unaware
>>>>>>> of such
>>>>>>> an option and believe it could cause mpirun to become confused.
>>>>>> Ops! Sorry, I copied the wrong log, below there the right one:
>>>>>>
>>>>>> [roberto_at_master TestOpenMPI]$ qsub -I testmaster.pbs
>>>>>> qsub: waiting for job 920.master.tekno-soft.it to start
>>>>>> qsub: job 920.master.tekno-soft.it ready
>>>>>>
>>>>>> [roberto_at_master TestMPICH2]$ cd /data/roberto/MPI/TestOpenMPI/
>>>>>> [roberto_at_master TestOpenMPI]$ mpirun --debug-daemons --mca btl
>>>>>> tcp,self
>>>>>> -wdir "`pwd`" -np 1 testmaster 100000 $PBS_NODEFILE
>>>>>> [master.tekno-soft.it:05407] [[11340,0],0] orted_cmd: received
>>>>>> add_local_procs
>>>>>> [master.tekno-soft.it:05407] [[11340,0],0] node[0].name master
>>>>>> daemon 0
>>>>>> arch ffc91200
>>>>>> [master.tekno-soft.it:05407] [[11340,0],0] node[1].name cluster4
>>>>>> daemon
>>>>>> INVALID arch ffc91200
>>>>>> [master.tekno-soft.it:05407] [[11340,0],0] node[2].name cluster3
>>>>>> daemon
>>>>>> INVALID arch ffc91200
>>>>>> [master.tekno-soft.it:05407] [[11340,0],0] node[3].name cluster2
>>>>>> daemon
>>>>>> INVALID arch ffc91200
>>>>>> [master.tekno-soft.it:05407] [[11340,0],0] node[4].name cluster1
>>>>>> daemon
>>>>>> INVALID arch ffc91200
>>>>>> Initializing MPI ...
>>>>>> [master.tekno-soft.it:05407] [[11340,0],0] orted_recv: received
>>>>>> sync+nidmap from local proc [[11340,1],0]
>>>>>> [master.tekno-soft.it:05407] [[11340,0],0] orted_cmd: received
>>>>>> collective data cmd
>>>>>> [master.tekno-soft.it:05407] [[11340,0],0] orted_cmd: received
>>>>>> message_local_procs
>>>>>> [master.tekno-soft.it:05407] [[11340,0],0] orted_cmd: received
>>>>>> collective data cmd
>>>>>> [master.tekno-soft.it:05407] [[11340,0],0] orted_cmd: received
>>>>>> message_local_procs
>>>>>> Loading the node's ring from file
>>>>>> '/var/torque/aux//920.master.tekno-soft.it'
>>>>>> ... adding node #1 host is 'cluster4.tekno-soft.it'
>>>>>> ... adding node #2 host is 'cluster3.tekno-soft.it'
>>>>>> ... adding node #3 host is 'cluster2.tekno-soft.it'
>>>>>> ... adding node #4 host is 'cluster1.tekno-soft.it'
>>>>>> A 4 node's ring has been made
>>>>>> At least one node is available, let's start to distribute
>>>>>> 100000 job
>>>>>> across 4 nodes!!!
>>>>>> ****************** Starting job #1
>>>>>> ****************** Starting job #2
>>>>>> ****************** Starting job #3
>>>>>> Setting up the host as 'cluster4.tekno-soft.it'
>>>>>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>>>>>> Spawning a task 'testslave' on node 'cluster4.tekno-soft.it'
>>>>>> Setting up the host as 'cluster3.tekno-soft.it'
>>>>>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>>>>>> Spawning a task 'testslave' on node 'cluster3.tekno-soft.it'
>>>>>> Setting up the host as 'cluster2.tekno-soft.it'
>>>>>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>>>>>> Spawning a task 'testslave' on node 'cluster2.tekno-soft.it'
>>>>>> Setting up the host as 'cluster1.tekno-soft.it'
>>>>>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>>>>>> Spawning a task 'testslave' on node 'cluster1.tekno-soft.it'
>>>>>> ****************** Starting job #4
>>>>>> Daemon was launched on cluster3.tekno-soft.it - beginning to
>>>>>> initialize
>>>>>> Daemon [[11340,0],1] checking in as pid 9487 on host
>>>>>> cluster3.tekno-soft.it
>>>>>> Daemon [[11340,0],1] not using static ports
>>>>>> --------------------------------------------------------------------------
>>>>>>
>>>>>>
>>>>>> A daemon (pid unknown) died unexpectedly on signal 1 while
>>>>>> attempting to
>>>>>> launch so we are aborting.
>>>>>>
>>>>>> There may be more information reported by the environment (see
>>>>>> above).
>>>>>>
>>>>>> This may be because the daemon was unable to find all the needed
>>>>>> shared
>>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to
>>>>>> have the
>>>>>> location of the shared libraries on the remote nodes and this
>>>>>> will
>>>>>> automatically be forwarded to the remote nodes.
>>>>>> --------------------------------------------------------------------------
>>>>>>
>>>>>>
>>>>>> [master.tekno-soft.it:05407] [[11340,0],0] ORTE_ERROR_LOG:
>>>>>> Resource busy
>>>>>> in file base/plm_base_receive.c at line 169
>>>>>> [master.tekno-soft.it:05414] [[11340,1],0] ORTE_ERROR_LOG: The
>>>>>> specified
>>>>>> application failed to start in file dpm_orte.c at line 677
>>>>>> [master.tekno-soft.it:05414] *** An error occurred in
>>>>>> MPI_Comm_spawn
>>>>>> [master.tekno-soft.it:05414] *** on communicator MPI_COMM_WORLD
>>>>>> [master.tekno-soft.it:05414] *** MPI_ERR_SPAWN: could not spawn
>>>>>> processes
>>>>>> [master.tekno-soft.it:05414] *** MPI_ERRORS_ARE_FATAL (goodbye)
>>>>>>
>>>>>> [master.tekno-soft.it:05407] [[11340,0],0] ORTE_ERROR_LOG:
>>>>>> Resource busy
>>>>>> in file base/plm_base_receive.c at line 169
>>>>>> [master.tekno-soft.it:05407] [[11340,0],0] orted_cmd: received
>>>>>> add_local_procs
>>>>>> [master.tekno-soft.it:05407] [[11340,0],0] node[0].name master
>>>>>> daemon 0
>>>>>> arch ffc91200
>>>>>> [master.tekno-soft.it:05407] [[11340,0],0] node[1].name cluster4
>>>>>> daemon
>>>>>> INVALID arch ffc91200
>>>>>> [master.tekno-soft.it:05407] [[11340,0],0] node[2].name cluster3
>>>>>> daemon
>>>>>> 1 arch ffc91200
>>>>>> [master.tekno-soft.it:05407] [[11340,0],0] node[3].name cluster2
>>>>>> daemon
>>>>>> INVALID arch ffc91200
>>>>>> [master.tekno-soft.it:05407] [[11340,0],0] node[4].name cluster1
>>>>>> daemon
>>>>>> INVALID arch ffc91200
>>>>>> [cluster3.tekno-soft.it:09487] [[11340,0],1] orted: up and
>>>>>> running -
>>>>>> waiting for commands!
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Ralph
>>>>>>>
>>>>>>>
>>>>>>> On Sep 30, 2008, at 8:24 AM, Roberto Fichera wrote:
>>>>>>>
>>>>>>>> Roberto Fichera ha scritto:
>>>>>>>>> Hi All on the list,
>>>>>>>>>
>>>>>>>>> I'm trying to execute dynamic MPI applications using
>>>>>>>>> MPI_Comm_spawn().
>>>>>>>>> The application I'm using for tests, basically is
>>>>>>>>> composed by a master, which spawn a slave in each assigned
>>>>>>>>> node
>>>>>>>>> in a
>>>>>>>>> multithreading fashion. The master is started with a
>>>>>>>>> number of jobs to perform and a filename, containing the
>>>>>>>>> list of
>>>>>>>>> assigned nodes. The idea is to handle all the dispatching
>>>>>>>>> logic within the application, so that the master will try to
>>>>>>>>> take as
>>>>>>>>> busy as possible each assigned node. Said that, for each
>>>>>>>>> spawned
>>>>>>>>> job, the master allocate a thread for spawning and handling
>>>>>>>>> the
>>>>>>>>> communication, than generate a random number, send it to the
>>>>>>>>> slave which simply send it back to the master. Finally the
>>>>>>>>> slave
>>>>>>>>> terminate its job and the relative node become free for a
>>>>>>>>> new one.
>>>>>>>>> The things will continue until all the requested jobs are
>>>>>>>>> done.
>>>>>>>>>
>>>>>>>>> The test program I'm using *doesn't* work flawless in mpich2
>>>>>>>>> because it
>>>>>>>>> has a ~24k spawned job limitation, due to a monotonically
>>>>>>>>> increasing of its internal context id which basically stops
>>>>>>>>> the
>>>>>>>>> application due to a library internal overflow. The internal
>>>>>>>>> context
>>>>>>>>> id,
>>>>>>>>> allocated
>>>>>>>>> for each terminated spawned job, are never recycled at
>>>>>>>>> moment. The
>>>>>>>>> unique MPI-2 implementation, so supporting MPI_Comm_spawn(),
>>>>>>>>> which was able to complete the test is currently the HP MPI.
>>>>>>>>> So
>>>>>>>>> now I
>>>>>>>>> would start to check OpenMPI if it's suitable for our dynamic
>>>>>>>>> parallel
>>>>>>>>> applications.
>>>>>>>>>
>>>>>>>>> The test application is linked against OpenMPI v1.3a1r19645,
>>>>>>>>> running of
>>>>>>>>> Fedora8 x86_64 + all updates.
>>>>>>>>>
>>>>>>>>> My first attempt end up on the error below which I basically
>>>>>>>>> don't
>>>>>>>>> know
>>>>>>>>> where to look further. Note that I've already checked PATHs
>>>>>>>>> and
>>>>>>>>> LD_LIBRARY_PATH, the application is basically configured
>>>>>>>>> correctly
>>>>>>>>> since
>>>>>>>>> it uses two scripts for starting and all the paths are set
>>>>>>>>> there.
>>>>>>>>> Basically I need to start *one* master application which will
>>>>>>>>> handle
>>>>>>>>> all
>>>>>>>>> the things for managing slave applications. The
>>>>>>>>> communication is
>>>>>>>>> *only*
>>>>>>>>> master <-> slave and never collective, at moment.
>>>>>>>>>
>>>>>>>>> The test program is available on request.
>>>>>>>>>
>>>>>>>>> Does any one have an idea what's going on?
>>>>>>>>>
>>>>>>>>> Thanks in advance,
>>>>>>>>> Roberto Fichera.
>>>>>>>>>
>>>>>>>>> [roberto_at_cluster4 TestOpenMPI]$ orterun -wdir
>>>>>>>>> /data/roberto/MPI/TestOpenMPI -np
>>>>>>>>> 1 testmaster 10000 $PBS_NODEFILE
>>>>>>>>> Initializing MPI ...
>>>>>>>>> Loading the node's ring from file
>>>>>>>>> '/var/torque/aux//909.master.tekno-soft.it'
>>>>>>>>> ... adding node #1 host is 'cluster3.tekno-soft.it'
>>>>>>>>> ... adding node #2 host is 'cluster2.tekno-soft.it'
>>>>>>>>> ... adding node #3 host is 'cluster1.tekno-soft.it'
>>>>>>>>> ... adding node #4 host is 'master.tekno-soft.it'
>>>>>>>>> A 4 node's ring has been made
>>>>>>>>> At least one node is available, let's start to distribute
>>>>>>>>> 10000
>>>>>>>>> job
>>>>>>>>> across 4
>>>>>>>>> nodes!!!
>>>>>>>>> ****************** Starting job #1
>>>>>>>>> ****************** Starting job #2
>>>>>>>>> ****************** Starting job #3
>>>>>>>>> ****************** Starting job #4
>>>>>>>>> Setting up the host as 'cluster3.tekno-soft.it'
>>>>>>>>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>>>>>>>>> Spawning a task './testslave.sh' on node 'cluster3.tekno-
>>>>>>>>> soft.it'
>>>>>>>>> Setting up the host as 'cluster2.tekno-soft.it'
>>>>>>>>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>>>>>>>>> Spawning a task './testslave.sh' on node 'cluster2.tekno-
>>>>>>>>> soft.it'
>>>>>>>>> Setting up the host as 'cluster1.tekno-soft.it'
>>>>>>>>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>>>>>>>>> Spawning a task './testslave.sh' on node 'cluster1.tekno-
>>>>>>>>> soft.it'
>>>>>>>>> Setting up the host as 'master.tekno-soft.it'
>>>>>>>>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>>>>>>>>> Spawning a task './testslave.sh' on node 'master.tekno-
>>>>>>>>> soft.it'
>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> A daemon (pid unknown) died unexpectedly on signal 1 while
>>>>>>>>> attempting to
>>>>>>>>> launch so we are aborting.
>>>>>>>>>
>>>>>>>>> There may be more information reported by the environment (see
>>>>>>>>> above).
>>>>>>>>>
>>>>>>>>> This may be because the daemon was unable to find all the
>>>>>>>>> needed
>>>>>>>>> shared
>>>>>>>>> libraries on the remote node. You may set your
>>>>>>>>> LD_LIBRARY_PATH to
>>>>>>>>> have the
>>>>>>>>> location of the shared libraries on the remote nodes and
>>>>>>>>> this will
>>>>>>>>> automatically be forwarded to the remote nodes.
>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [cluster4.tekno-soft.it:21287] [[30014,0],0] ORTE_ERROR_LOG:
>>>>>>>>> Resource busy in
>>>>>>>>> file base/plm_base_receive.c at line 169
>>>>>>>>> [cluster4.tekno-soft.it:21287] [[30014,0],0] ORTE_ERROR_LOG:
>>>>>>>>> Resource busy in
>>>>>>>>> file base/plm_base_receive.c at line 169
>>>>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users