Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Running application with MPI_Comm_spawn() in multithreaded environment
From: Ralph Castain (rhc_at_[hidden])
Date: 2008-10-01 13:00:13


Actually, it just occurred to me that you may be seeing a problem in
comm_spawn itself that I am currently chasing down. It is in the 1.3
branch and has to do with comm_spawning procs on subsets of nodes
(instead of across all nodes). Could be related to this - you might
want to give me a chance to complete the fix. I have identified the
problem and should have it fixed later today in our trunk - probably
won't move to the 1.3 branch for several days.

Ralph

On Oct 1, 2008, at 10:43 AM, Ralph Castain wrote:

> Afraid I am somewhat at a loss. The logs indicate that mpirun itself
> is having problems, likely caused by the threading. Only thing I can
> suggest is that you "unthread" the spawning loop and try it that way
> first so we can see if some underlying problem exists.
>
> FWIW: I have run a loop over calls to comm_spawn without problems.
> However, there are system limits to the number of child processes an
> orted can create. You may hit those at some point - we try to report
> that as a separate error when we see it, but it isn't always easy to
> catch.
>
> Like I said, we really don't support threaded operations like this
> right now, so I have no idea what your app may be triggering. I
> would definitely try it "unthreaded" if possible.
>
> Ralph
>
>
> On Oct 1, 2008, at 9:04 AM, Roberto Fichera wrote:
>
>> Ralph Castain ha scritto:
>>> Okay, I believe I understand the problem. What this error is telling
>>> you is that the Torque MOM is refusing our connection request
>>> because
>>> it is already busy. So we cannot spawn another process.
>>>
>>> If I understand your application correctly, you are spinning off
>>> multiple threads, each attempting to comm_spawn a single process -
>>> true? The problem with that design is that - since OMPI is not
>>> thread
>>> safe yet - these threads are all attempting to connect to the MOM at
>>> the same time. The MOM will only allow one connection at a time, and
>>> so at some point we are requesting a connection while already
>>> connected.
>>>
>>> Since we are some ways off from attaining thread safety in these
>>> scenarios, you really have three choices:
>>>
>>> 1. you could do this with a single comm_spawn call. Remember, you
>>> can
>>> provide an MPI_Info key to comm_spawn essentially telling it where
>>> to
>>> place the various process ranks. Unless you truly want each new
>>> process to be in its own comm_world, there is no real need to do
>>> this
>>> with 10000 individual calls to comm_spawn.
>> I need only a master to single slave communication, the slaves
>> *doesn't*
>> need to communicate all together. The logic within the test program
>> is quite easy, it will dispatch as many jobs as the user need
>> across the
>> assigned nodes, try to take it busy as much as possible. That's
>> because
>> our algorithms need a tree evolution where a node is master of a
>> bounch
>> of slaves, and a slave can be a sub-master of a bounch of slaves,
>> this
>> depends by how the each leaf will evolve in its computation.
>> Generally
>> we don't go to much than 5 o 6 levels in deep. But we need a very
>> dynamic logic for dispatching jobs.
>>> 2. you could execute your own thread locking scheme in your
>>> application so that only one thread calls comm_spawn at a time.
>> I done it with and without _tm_ support and using a mutex to
>> serialize
>> the MPI_Comm_spawn().
>> The log below is using the torque/pbs support compiled in:
>>
>> [roberto_at_master TestOpenMPI]$ mpirun --verbose --debug-daemons -wdir
>> "`pwd`" -np 1 testmaster 100000 $PBS_NODEFILE
>> [master.tekno-soft.it:07844] [[10231,0],0] orted_cmd: received
>> add_local_procs
>> [master.tekno-soft.it:07844] [[10231,0],0] node[0].name master
>> daemon 0
>> arch ffc91200
>> [master.tekno-soft.it:07844] [[10231,0],0] node[1].name cluster4
>> daemon
>> INVALID arch ffc91200
>> [master.tekno-soft.it:07844] [[10231,0],0] node[2].name cluster3
>> daemon
>> INVALID arch ffc91200
>> [master.tekno-soft.it:07844] [[10231,0],0] node[3].name cluster2
>> daemon
>> INVALID arch ffc91200
>> [master.tekno-soft.it:07844] [[10231,0],0] node[4].name cluster1
>> daemon
>> INVALID arch ffc91200
>> Initializing MPI ...
>> [master.tekno-soft.it:07844] [[10231,0],0] orted_recv: received
>> sync+nidmap from local proc [[10231,1],0]
>> [master.tekno-soft.it:07844] [[10231,0],0] orted_cmd: received
>> collective data cmd
>> [master.tekno-soft.it:07844] [[10231,0],0] orted_cmd: received
>> message_local_procs
>> [master.tekno-soft.it:07844] [[10231,0],0] orted_cmd: received
>> collective data cmd
>> [master.tekno-soft.it:07844] [[10231,0],0] orted_cmd: received
>> message_local_procs
>> Loading the node's ring from file
>> '/var/torque/aux//929.master.tekno-soft.it'
>> ... adding node #1 host is 'cluster4.tekno-soft.it'
>> ... adding node #2 host is 'cluster3.tekno-soft.it'
>> ... adding node #3 host is 'cluster2.tekno-soft.it'
>> ... adding node #4 host is 'cluster1.tekno-soft.it'
>> A 4 node's ring has been made
>> At least one node is available, let's start to distribute 100000 job
>> across 4 nodes!!!
>> ****************** Starting job #1
>> ****************** Starting job #2
>> ****************** Starting job #3
>> ****************** Starting job #4
>> Setting up the host as 'cluster4.tekno-soft.it'
>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>> Spawning a task 'testslave.sh' on node 'cluster4.tekno-soft.it'
>> Setting up the host as 'cluster3.tekno-soft.it'
>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>> Spawning a task 'testslave.sh' on node 'cluster3.tekno-soft.it'
>> Setting up the host as 'cluster2.tekno-soft.it'
>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>> Spawning a task 'testslave.sh' on node 'cluster2.tekno-soft.it'
>> Setting up the host as 'cluster1.tekno-soft.it'
>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>> Spawning a task 'testslave.sh' on node 'cluster1.tekno-soft.it'
>> Daemon was launched on cluster4.tekno-soft.it - beginning to
>> initialize
>> Daemon [[10231,0],1] checking in as pid 4869 on host cluster4.tekno-
>> soft.it
>> Daemon [[10231,0],1] not using static ports
>> [cluster4.tekno-soft.it:04869] [[10231,0],1] orted: up and running -
>> waiting for commands!
>> [master.tekno-soft.it:07844] [[10231,0],0] orted_cmd: received
>> add_local_procs
>> [master.tekno-soft.it:07844] [[10231,0],0] node[0].name master
>> daemon 0
>> arch ffc91200
>> [master.tekno-soft.it:07844] [[10231,0],0] node[1].name cluster4
>> daemon
>> 1 arch ffc91200
>> [master.tekno-soft.it:07844] [[10231,0],0] node[2].name cluster3
>> daemon
>> INVALID arch ffc91200
>> [master.tekno-soft.it:07844] [[10231,0],0] node[3].name cluster2
>> daemon
>> INVALID arch ffc91200
>> [master.tekno-soft.it:07844] [[10231,0],0] node[4].name cluster1
>> daemon
>> INVALID arch ffc91200
>> [cluster4.tekno-soft.it:04869] [[10231,0],1] orted_cmd: received
>> add_local_procs
>> [cluster4.tekno-soft.it:04869] [[10231,0],1] node[0].name master
>> daemon
>> 0 arch ffc91200
>> [cluster4.tekno-soft.it:04869] [[10231,0],1] node[1].name cluster4
>> daemon 1 arch ffc91200
>> [cluster4.tekno-soft.it:04869] [[10231,0],1] node[2].name cluster3
>> daemon INVALID arch ffc91200
>> [cluster4.tekno-soft.it:04869] [[10231,0],1] node[3].name cluster2
>> daemon INVALID arch ffc91200
>> [cluster4.tekno-soft.it:04869] [[10231,0],1] node[4].name cluster1
>> daemon INVALID arch ffc91200
>> [cluster4.tekno-soft.it:04869] [[10231,0],1] orted_recv: received
>> sync+nidmap from local proc [[10231,2],0]
>> [cluster4.tekno-soft.it:04869] [[10231,0],1] orted_cmd: received
>> collective data cmd
>> [master.tekno-soft.it:07844] [[10231,0],0] orted_cmd: received
>> collective data cmd
>> [master.tekno-soft.it:07844] [[10231,0],0] orted_cmd: received
>> message_local_procs
>> [cluster4.tekno-soft.it:04869] [[10231,0],1] orted_cmd: received
>> message_local_procs
>> [cluster4.tekno-soft.it:04869] [[10231,0],1] orted_cmd: received
>> collective data cmd
>> [master.tekno-soft.it:07844] [[10231,0],0] orted_cmd: received
>> collective data cmd
>> [master.tekno-soft.it:07844] [[10231,0],0] orted_cmd: received
>> message_local_procs
>> [cluster4.tekno-soft.it:04869] [[10231,0],1] orted_cmd: received
>> message_local_procs
>> Killed
>> [cluster4.tekno-soft.it:04869] [[10231,0],1] routed:binomial:
>> Connection
>> to lifeline [[10231,0],0] lost
>> [cluster4.tekno-soft.it:04869] [[10231,0],1] routed:binomial:
>> Connection
>> to lifeline [[10231,0],0] lost
>> [roberto_at_master TestOpenMPI]$
>>
>> this one is *without-tm*
>>
>> [roberto_at_master TestOpenMPI]$ mpirun --verbose --debug-daemons -wdir
>> "`pwd`" -np 1 testmaster 100000 $PBS_NODEFILE
>> [master.tekno-soft.it:25143] [[23396,0],0] orted_cmd: received
>> add_local_procs
>> [master.tekno-soft.it:25143] [[23396,0],0] node[0].name master
>> daemon 0
>> arch ffc91200
>> [master.tekno-soft.it:25143] [[23396,0],0] node[1].name cluster4
>> daemon
>> INVALID arch ffc91200
>> [master.tekno-soft.it:25143] [[23396,0],0] node[2].name cluster3
>> daemon
>> INVALID arch ffc91200
>> [master.tekno-soft.it:25143] [[23396,0],0] node[3].name cluster2
>> daemon
>> INVALID arch ffc91200
>> [master.tekno-soft.it:25143] [[23396,0],0] node[4].name cluster1
>> daemon
>> INVALID arch ffc91200
>> Initializing MPI ...
>> [master.tekno-soft.it:25143] [[23396,0],0] orted_recv: received
>> sync+nidmap from local proc [[23396,1],0]
>> [master.tekno-soft.it:25143] [[23396,0],0] orted_cmd: received
>> collective data cmd
>> [master.tekno-soft.it:25143] [[23396,0],0] orted_cmd: received
>> message_local_procs
>> [master.tekno-soft.it:25143] [[23396,0],0] orted_cmd: received
>> collective data cmd
>> [master.tekno-soft.it:25143] [[23396,0],0] orted_cmd: received
>> message_local_procs
>> Loading the node's ring from file
>> '/var/torque/aux//928.master.tekno-soft.it'
>> ... adding node #1 host is 'cluster4.tekno-soft.it'
>> ... adding node #2 host is 'cluster3.tekno-soft.it'
>> ... adding node #3 host is 'cluster2.tekno-soft.it'
>> ... adding node #4 host is 'cluster1.tekno-soft.it'
>> A 4 node's ring has been made
>> At least one node is available, let's start to distribute 100000 job
>> across 4 nodes!!!
>> ****************** Starting job #1
>> ****************** Starting job #2
>> ****************** Starting job #3
>> ****************** Starting job #4
>> Setting up the host as 'cluster4.tekno-soft.it'
>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>> Spawning a task 'testslave.sh' on node 'cluster4.tekno-soft.it'
>> Setting up the host as 'cluster3.tekno-soft.it'
>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>> Spawning a task 'testslave.sh' on node 'cluster3.tekno-soft.it'
>> Setting up the host as 'cluster2.tekno-soft.it'
>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>> Spawning a task 'testslave.sh' on node 'cluster2.tekno-soft.it'
>> Setting up the host as 'cluster1.tekno-soft.it'
>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>> Spawning a task 'testslave.sh' on node 'cluster1.tekno-soft.it'
>> [master.tekno-soft.it:25143] [[23396,0],0] orted_cmd: received
>> add_local_procs
>> [master.tekno-soft.it:25143] [[23396,0],0] node[0].name master
>> daemon 0
>> arch ffc91200
>> [master.tekno-soft.it:25143] [[23396,0],0] node[1].name cluster4
>> daemon
>> 1 arch ffc91200
>> [master.tekno-soft.it:25143] [[23396,0],0] node[2].name cluster3
>> daemon
>> INVALID arch ffc91200
>> [master.tekno-soft.it:25143] [[23396,0],0] node[3].name cluster2
>> daemon
>> INVALID arch ffc91200
>> [master.tekno-soft.it:25143] [[23396,0],0] node[4].name cluster1
>> daemon
>> INVALID arch ffc91200
>> Daemon was launched on cluster4.tekno-soft.it - beginning to
>> initialize
>> Daemon [[23396,0],1] checking in as pid 3653 on host cluster4.tekno-
>> soft.it
>> Daemon [[23396,0],1] not using static ports
>> [cluster4.tekno-soft.it:03653] [[23396,0],1] orted: up and running -
>> waiting for commands!
>> [cluster4.tekno-soft.it:03653] [[23396,0],1] orted_cmd: received
>> add_local_procs
>> [cluster4.tekno-soft.it:03653] [[23396,0],1] node[0].name master
>> daemon
>> 0 arch ffc91200
>> [cluster4.tekno-soft.it:03653] [[23396,0],1] node[1].name cluster4
>> daemon 1 arch ffc91200
>> [cluster4.tekno-soft.it:03653] [[23396,0],1] node[2].name cluster3
>> daemon INVALID arch ffc91200
>> [cluster4.tekno-soft.it:03653] [[23396,0],1] node[3].name cluster2
>> daemon INVALID arch ffc91200
>> [cluster4.tekno-soft.it:03653] [[23396,0],1] node[4].name cluster1
>> daemon INVALID arch ffc91200
>> [cluster4.tekno-soft.it:03653] [[23396,0],1] orted_recv: received
>> sync+nidmap from local proc [[23396,2],0]
>> [cluster4.tekno-soft.it:03653] [[23396,0],1] orted_cmd: received
>> collective data cmd
>> [master.tekno-soft.it:25143] [[23396,0],0] orted_cmd: received
>> collective data cmd
>> [master.tekno-soft.it:25143] [[23396,0],0] orted_cmd: received
>> message_local_procs
>> [cluster4.tekno-soft.it:03653] [[23396,0],1] orted_cmd: received
>> message_local_procs
>> [cluster4.tekno-soft.it:03653] [[23396,0],1] orted_cmd: received
>> collective data cmd
>> [master.tekno-soft.it:25143] [[23396,0],0] orted_cmd: received
>> collective data cmd
>> [master.tekno-soft.it:25143] [[23396,0],0] orted_cmd: received
>> message_local_procs
>> [cluster4.tekno-soft.it:03653] [[23396,0],1] orted_cmd: received
>> message_local_procs
>>
>> [... got a freeze here ... than ^C ...]
>>
>> mpirun: killing job...
>>
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 0 with PID 25150 on node
>> master.tekno-soft.it exited on signal 0 (Unknown signal 0).
>> --------------------------------------------------------------------------
>> [cluster4.tekno-soft.it:03653] [[23396,0],1] orted_cmd: received exit
>> [cluster4.tekno-soft.it:03653] [[23396,0],1] orted: finalizing
>> mpirun: clean termination accomplished
>>
>> [cluster4:03653] *** Process received signal ***
>> [cluster4:03653] Signal: Segmentation fault (11)
>> [cluster4:03653] Signal code: Address not mapped (1)
>> [cluster4:03653] Failing at address: 0x2aaaab784af0
>>
>> So it seem that we have some problem in other places, maybe some
>> other
>> functions are not
>> thread safe.
>>
>>> 3. remove the threaded launch scenario and just call comm_spawn in a
>>> loop.
>>>
>>> In truth, the threaded approach to spawning all these procs isn't
>>> gaining you anything. Torque will only do one launch at a time
>>> anyway,
>>> so you will launch them serially no matter what. You may just be
>>> adding complexity for no real net gain.
>> Talking about torque/pbs/maui, that's ok! It doen't handle multiple
>> spawn at the same time.
>> But in general, if I don't used _tm_ any more, I guess that we can
>> get a
>> gain on executing parallel spawn,
>> because the spawn will be done using ssh/rsh.
>>>
>>> Ralph
>>>
>>> On Oct 1, 2008, at 1:56 AM, Roberto Fichera wrote:
>>>
>>>> Ralph Castain ha scritto:
>>>>> Hi Roberto
>>>>>
>>>>> There is something wrong with this cmd line - perhaps it wasn't
>>>>> copied
>>>>> correctly?
>>>>>
>>>>> mpirun --verbose --debug-daemons --mca obl -np 1 -wdir `pwd`
>>>>> testmaster 10000 $PBS_NODEFILE
>>>>>
>>>>> Specifically, the following is incomplete: --mca obl
>>>>>
>>>>> I'm not sure if this is the problem or not, but I am unaware of
>>>>> such
>>>>> an option and believe it could cause mpirun to become confused.
>>>> Ops! Sorry, I copied the wrong log, below there the right one:
>>>>
>>>> [roberto_at_master TestOpenMPI]$ qsub -I testmaster.pbs
>>>> qsub: waiting for job 920.master.tekno-soft.it to start
>>>> qsub: job 920.master.tekno-soft.it ready
>>>>
>>>> [roberto_at_master TestMPICH2]$ cd /data/roberto/MPI/TestOpenMPI/
>>>> [roberto_at_master TestOpenMPI]$ mpirun --debug-daemons --mca btl
>>>> tcp,self
>>>> -wdir "`pwd`" -np 1 testmaster 100000 $PBS_NODEFILE
>>>> [master.tekno-soft.it:05407] [[11340,0],0] orted_cmd: received
>>>> add_local_procs
>>>> [master.tekno-soft.it:05407] [[11340,0],0] node[0].name master
>>>> daemon 0
>>>> arch ffc91200
>>>> [master.tekno-soft.it:05407] [[11340,0],0] node[1].name cluster4
>>>> daemon
>>>> INVALID arch ffc91200
>>>> [master.tekno-soft.it:05407] [[11340,0],0] node[2].name cluster3
>>>> daemon
>>>> INVALID arch ffc91200
>>>> [master.tekno-soft.it:05407] [[11340,0],0] node[3].name cluster2
>>>> daemon
>>>> INVALID arch ffc91200
>>>> [master.tekno-soft.it:05407] [[11340,0],0] node[4].name cluster1
>>>> daemon
>>>> INVALID arch ffc91200
>>>> Initializing MPI ...
>>>> [master.tekno-soft.it:05407] [[11340,0],0] orted_recv: received
>>>> sync+nidmap from local proc [[11340,1],0]
>>>> [master.tekno-soft.it:05407] [[11340,0],0] orted_cmd: received
>>>> collective data cmd
>>>> [master.tekno-soft.it:05407] [[11340,0],0] orted_cmd: received
>>>> message_local_procs
>>>> [master.tekno-soft.it:05407] [[11340,0],0] orted_cmd: received
>>>> collective data cmd
>>>> [master.tekno-soft.it:05407] [[11340,0],0] orted_cmd: received
>>>> message_local_procs
>>>> Loading the node's ring from file
>>>> '/var/torque/aux//920.master.tekno-soft.it'
>>>> ... adding node #1 host is 'cluster4.tekno-soft.it'
>>>> ... adding node #2 host is 'cluster3.tekno-soft.it'
>>>> ... adding node #3 host is 'cluster2.tekno-soft.it'
>>>> ... adding node #4 host is 'cluster1.tekno-soft.it'
>>>> A 4 node's ring has been made
>>>> At least one node is available, let's start to distribute 100000
>>>> job
>>>> across 4 nodes!!!
>>>> ****************** Starting job #1
>>>> ****************** Starting job #2
>>>> ****************** Starting job #3
>>>> Setting up the host as 'cluster4.tekno-soft.it'
>>>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>>>> Spawning a task 'testslave' on node 'cluster4.tekno-soft.it'
>>>> Setting up the host as 'cluster3.tekno-soft.it'
>>>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>>>> Spawning a task 'testslave' on node 'cluster3.tekno-soft.it'
>>>> Setting up the host as 'cluster2.tekno-soft.it'
>>>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>>>> Spawning a task 'testslave' on node 'cluster2.tekno-soft.it'
>>>> Setting up the host as 'cluster1.tekno-soft.it'
>>>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>>>> Spawning a task 'testslave' on node 'cluster1.tekno-soft.it'
>>>> ****************** Starting job #4
>>>> Daemon was launched on cluster3.tekno-soft.it - beginning to
>>>> initialize
>>>> Daemon [[11340,0],1] checking in as pid 9487 on host
>>>> cluster3.tekno-soft.it
>>>> Daemon [[11340,0],1] not using static ports
>>>> --------------------------------------------------------------------------
>>>>
>>>> A daemon (pid unknown) died unexpectedly on signal 1 while
>>>> attempting to
>>>> launch so we are aborting.
>>>>
>>>> There may be more information reported by the environment (see
>>>> above).
>>>>
>>>> This may be because the daemon was unable to find all the needed
>>>> shared
>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to
>>>> have the
>>>> location of the shared libraries on the remote nodes and this will
>>>> automatically be forwarded to the remote nodes.
>>>> --------------------------------------------------------------------------
>>>>
>>>> [master.tekno-soft.it:05407] [[11340,0],0] ORTE_ERROR_LOG:
>>>> Resource busy
>>>> in file base/plm_base_receive.c at line 169
>>>> [master.tekno-soft.it:05414] [[11340,1],0] ORTE_ERROR_LOG: The
>>>> specified
>>>> application failed to start in file dpm_orte.c at line 677
>>>> [master.tekno-soft.it:05414] *** An error occurred in
>>>> MPI_Comm_spawn
>>>> [master.tekno-soft.it:05414] *** on communicator MPI_COMM_WORLD
>>>> [master.tekno-soft.it:05414] *** MPI_ERR_SPAWN: could not spawn
>>>> processes
>>>> [master.tekno-soft.it:05414] *** MPI_ERRORS_ARE_FATAL (goodbye)
>>>>
>>>> [master.tekno-soft.it:05407] [[11340,0],0] ORTE_ERROR_LOG:
>>>> Resource busy
>>>> in file base/plm_base_receive.c at line 169
>>>> [master.tekno-soft.it:05407] [[11340,0],0] orted_cmd: received
>>>> add_local_procs
>>>> [master.tekno-soft.it:05407] [[11340,0],0] node[0].name master
>>>> daemon 0
>>>> arch ffc91200
>>>> [master.tekno-soft.it:05407] [[11340,0],0] node[1].name cluster4
>>>> daemon
>>>> INVALID arch ffc91200
>>>> [master.tekno-soft.it:05407] [[11340,0],0] node[2].name cluster3
>>>> daemon
>>>> 1 arch ffc91200
>>>> [master.tekno-soft.it:05407] [[11340,0],0] node[3].name cluster2
>>>> daemon
>>>> INVALID arch ffc91200
>>>> [master.tekno-soft.it:05407] [[11340,0],0] node[4].name cluster1
>>>> daemon
>>>> INVALID arch ffc91200
>>>> [cluster3.tekno-soft.it:09487] [[11340,0],1] orted: up and
>>>> running -
>>>> waiting for commands!
>>>>
>>>>
>>>>
>>>>>
>>>>> Ralph
>>>>>
>>>>>
>>>>> On Sep 30, 2008, at 8:24 AM, Roberto Fichera wrote:
>>>>>
>>>>>> Roberto Fichera ha scritto:
>>>>>>> Hi All on the list,
>>>>>>>
>>>>>>> I'm trying to execute dynamic MPI applications using
>>>>>>> MPI_Comm_spawn().
>>>>>>> The application I'm using for tests, basically is
>>>>>>> composed by a master, which spawn a slave in each assigned
>>>>>>> node in a
>>>>>>> multithreading fashion. The master is started with a
>>>>>>> number of jobs to perform and a filename, containing the list of
>>>>>>> assigned nodes. The idea is to handle all the dispatching
>>>>>>> logic within the application, so that the master will try to
>>>>>>> take as
>>>>>>> busy as possible each assigned node. Said that, for each spawned
>>>>>>> job, the master allocate a thread for spawning and handling the
>>>>>>> communication, than generate a random number, send it to the
>>>>>>> slave which simply send it back to the master. Finally the slave
>>>>>>> terminate its job and the relative node become free for a new
>>>>>>> one.
>>>>>>> The things will continue until all the requested jobs are done.
>>>>>>>
>>>>>>> The test program I'm using *doesn't* work flawless in mpich2
>>>>>>> because it
>>>>>>> has a ~24k spawned job limitation, due to a monotonically
>>>>>>> increasing of its internal context id which basically stops the
>>>>>>> application due to a library internal overflow. The internal
>>>>>>> context
>>>>>>> id,
>>>>>>> allocated
>>>>>>> for each terminated spawned job, are never recycled at moment.
>>>>>>> The
>>>>>>> unique MPI-2 implementation, so supporting MPI_Comm_spawn(),
>>>>>>> which was able to complete the test is currently the HP MPI.
>>>>>>> So now I
>>>>>>> would start to check OpenMPI if it's suitable for our dynamic
>>>>>>> parallel
>>>>>>> applications.
>>>>>>>
>>>>>>> The test application is linked against OpenMPI v1.3a1r19645,
>>>>>>> running of
>>>>>>> Fedora8 x86_64 + all updates.
>>>>>>>
>>>>>>> My first attempt end up on the error below which I basically
>>>>>>> don't
>>>>>>> know
>>>>>>> where to look further. Note that I've already checked PATHs and
>>>>>>> LD_LIBRARY_PATH, the application is basically configured
>>>>>>> correctly
>>>>>>> since
>>>>>>> it uses two scripts for starting and all the paths are set
>>>>>>> there.
>>>>>>> Basically I need to start *one* master application which will
>>>>>>> handle
>>>>>>> all
>>>>>>> the things for managing slave applications. The communication is
>>>>>>> *only*
>>>>>>> master <-> slave and never collective, at moment.
>>>>>>>
>>>>>>> The test program is available on request.
>>>>>>>
>>>>>>> Does any one have an idea what's going on?
>>>>>>>
>>>>>>> Thanks in advance,
>>>>>>> Roberto Fichera.
>>>>>>>
>>>>>>> [roberto_at_cluster4 TestOpenMPI]$ orterun -wdir
>>>>>>> /data/roberto/MPI/TestOpenMPI -np
>>>>>>> 1 testmaster 10000 $PBS_NODEFILE
>>>>>>> Initializing MPI ...
>>>>>>> Loading the node's ring from file
>>>>>>> '/var/torque/aux//909.master.tekno-soft.it'
>>>>>>> ... adding node #1 host is 'cluster3.tekno-soft.it'
>>>>>>> ... adding node #2 host is 'cluster2.tekno-soft.it'
>>>>>>> ... adding node #3 host is 'cluster1.tekno-soft.it'
>>>>>>> ... adding node #4 host is 'master.tekno-soft.it'
>>>>>>> A 4 node's ring has been made
>>>>>>> At least one node is available, let's start to distribute
>>>>>>> 10000 job
>>>>>>> across 4
>>>>>>> nodes!!!
>>>>>>> ****************** Starting job #1
>>>>>>> ****************** Starting job #2
>>>>>>> ****************** Starting job #3
>>>>>>> ****************** Starting job #4
>>>>>>> Setting up the host as 'cluster3.tekno-soft.it'
>>>>>>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>>>>>>> Spawning a task './testslave.sh' on node 'cluster3.tekno-
>>>>>>> soft.it'
>>>>>>> Setting up the host as 'cluster2.tekno-soft.it'
>>>>>>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>>>>>>> Spawning a task './testslave.sh' on node 'cluster2.tekno-
>>>>>>> soft.it'
>>>>>>> Setting up the host as 'cluster1.tekno-soft.it'
>>>>>>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>>>>>>> Spawning a task './testslave.sh' on node 'cluster1.tekno-
>>>>>>> soft.it'
>>>>>>> Setting up the host as 'master.tekno-soft.it'
>>>>>>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>>>>>>> Spawning a task './testslave.sh' on node 'master.tekno-soft.it'
>>>>>>> --------------------------------------------------------------------------
>>>>>>>
>>>>>>>
>>>>>>> A daemon (pid unknown) died unexpectedly on signal 1 while
>>>>>>> attempting to
>>>>>>> launch so we are aborting.
>>>>>>>
>>>>>>> There may be more information reported by the environment (see
>>>>>>> above).
>>>>>>>
>>>>>>> This may be because the daemon was unable to find all the needed
>>>>>>> shared
>>>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH
>>>>>>> to
>>>>>>> have the
>>>>>>> location of the shared libraries on the remote nodes and this
>>>>>>> will
>>>>>>> automatically be forwarded to the remote nodes.
>>>>>>> --------------------------------------------------------------------------
>>>>>>>
>>>>>>>
>>>>>>> [cluster4.tekno-soft.it:21287] [[30014,0],0] ORTE_ERROR_LOG:
>>>>>>> Resource busy in
>>>>>>> file base/plm_base_receive.c at line 169
>>>>>>> [cluster4.tekno-soft.it:21287] [[30014,0],0] ORTE_ERROR_LOG:
>>>>>>> Resource busy in
>>>>>>> file base/plm_base_receive.c at line 169
>>>>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users