Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Running application with MPI_Comm_spawn() in multithreaded environment
From: Ralph Castain (rhc_at_[hidden])
Date: 2008-10-01 12:43:47


Afraid I am somewhat at a loss. The logs indicate that mpirun itself
is having problems, likely caused by the threading. Only thing I can
suggest is that you "unthread" the spawning loop and try it that way
first so we can see if some underlying problem exists.

FWIW: I have run a loop over calls to comm_spawn without problems.
However, there are system limits to the number of child processes an
orted can create. You may hit those at some point - we try to report
that as a separate error when we see it, but it isn't always easy to
catch.

Like I said, we really don't support threaded operations like this
right now, so I have no idea what your app may be triggering. I would
definitely try it "unthreaded" if possible.

Ralph

On Oct 1, 2008, at 9:04 AM, Roberto Fichera wrote:

> Ralph Castain ha scritto:
>> Okay, I believe I understand the problem. What this error is telling
>> you is that the Torque MOM is refusing our connection request because
>> it is already busy. So we cannot spawn another process.
>>
>> If I understand your application correctly, you are spinning off
>> multiple threads, each attempting to comm_spawn a single process -
>> true? The problem with that design is that - since OMPI is not thread
>> safe yet - these threads are all attempting to connect to the MOM at
>> the same time. The MOM will only allow one connection at a time, and
>> so at some point we are requesting a connection while already
>> connected.
>>
>> Since we are some ways off from attaining thread safety in these
>> scenarios, you really have three choices:
>>
>> 1. you could do this with a single comm_spawn call. Remember, you can
>> provide an MPI_Info key to comm_spawn essentially telling it where to
>> place the various process ranks. Unless you truly want each new
>> process to be in its own comm_world, there is no real need to do this
>> with 10000 individual calls to comm_spawn.
> I need only a master to single slave communication, the slaves
> *doesn't*
> need to communicate all together. The logic within the test program
> is quite easy, it will dispatch as many jobs as the user need across
> the
> assigned nodes, try to take it busy as much as possible. That's
> because
> our algorithms need a tree evolution where a node is master of a
> bounch
> of slaves, and a slave can be a sub-master of a bounch of slaves, this
> depends by how the each leaf will evolve in its computation. Generally
> we don't go to much than 5 o 6 levels in deep. But we need a very
> dynamic logic for dispatching jobs.
>> 2. you could execute your own thread locking scheme in your
>> application so that only one thread calls comm_spawn at a time.
> I done it with and without _tm_ support and using a mutex to serialize
> the MPI_Comm_spawn().
> The log below is using the torque/pbs support compiled in:
>
> [roberto_at_master TestOpenMPI]$ mpirun --verbose --debug-daemons -wdir
> "`pwd`" -np 1 testmaster 100000 $PBS_NODEFILE
> [master.tekno-soft.it:07844] [[10231,0],0] orted_cmd: received
> add_local_procs
> [master.tekno-soft.it:07844] [[10231,0],0] node[0].name master
> daemon 0
> arch ffc91200
> [master.tekno-soft.it:07844] [[10231,0],0] node[1].name cluster4
> daemon
> INVALID arch ffc91200
> [master.tekno-soft.it:07844] [[10231,0],0] node[2].name cluster3
> daemon
> INVALID arch ffc91200
> [master.tekno-soft.it:07844] [[10231,0],0] node[3].name cluster2
> daemon
> INVALID arch ffc91200
> [master.tekno-soft.it:07844] [[10231,0],0] node[4].name cluster1
> daemon
> INVALID arch ffc91200
> Initializing MPI ...
> [master.tekno-soft.it:07844] [[10231,0],0] orted_recv: received
> sync+nidmap from local proc [[10231,1],0]
> [master.tekno-soft.it:07844] [[10231,0],0] orted_cmd: received
> collective data cmd
> [master.tekno-soft.it:07844] [[10231,0],0] orted_cmd: received
> message_local_procs
> [master.tekno-soft.it:07844] [[10231,0],0] orted_cmd: received
> collective data cmd
> [master.tekno-soft.it:07844] [[10231,0],0] orted_cmd: received
> message_local_procs
> Loading the node's ring from file
> '/var/torque/aux//929.master.tekno-soft.it'
> ... adding node #1 host is 'cluster4.tekno-soft.it'
> ... adding node #2 host is 'cluster3.tekno-soft.it'
> ... adding node #3 host is 'cluster2.tekno-soft.it'
> ... adding node #4 host is 'cluster1.tekno-soft.it'
> A 4 node's ring has been made
> At least one node is available, let's start to distribute 100000 job
> across 4 nodes!!!
> ****************** Starting job #1
> ****************** Starting job #2
> ****************** Starting job #3
> ****************** Starting job #4
> Setting up the host as 'cluster4.tekno-soft.it'
> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
> Spawning a task 'testslave.sh' on node 'cluster4.tekno-soft.it'
> Setting up the host as 'cluster3.tekno-soft.it'
> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
> Spawning a task 'testslave.sh' on node 'cluster3.tekno-soft.it'
> Setting up the host as 'cluster2.tekno-soft.it'
> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
> Spawning a task 'testslave.sh' on node 'cluster2.tekno-soft.it'
> Setting up the host as 'cluster1.tekno-soft.it'
> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
> Spawning a task 'testslave.sh' on node 'cluster1.tekno-soft.it'
> Daemon was launched on cluster4.tekno-soft.it - beginning to
> initialize
> Daemon [[10231,0],1] checking in as pid 4869 on host cluster4.tekno-
> soft.it
> Daemon [[10231,0],1] not using static ports
> [cluster4.tekno-soft.it:04869] [[10231,0],1] orted: up and running -
> waiting for commands!
> [master.tekno-soft.it:07844] [[10231,0],0] orted_cmd: received
> add_local_procs
> [master.tekno-soft.it:07844] [[10231,0],0] node[0].name master
> daemon 0
> arch ffc91200
> [master.tekno-soft.it:07844] [[10231,0],0] node[1].name cluster4
> daemon
> 1 arch ffc91200
> [master.tekno-soft.it:07844] [[10231,0],0] node[2].name cluster3
> daemon
> INVALID arch ffc91200
> [master.tekno-soft.it:07844] [[10231,0],0] node[3].name cluster2
> daemon
> INVALID arch ffc91200
> [master.tekno-soft.it:07844] [[10231,0],0] node[4].name cluster1
> daemon
> INVALID arch ffc91200
> [cluster4.tekno-soft.it:04869] [[10231,0],1] orted_cmd: received
> add_local_procs
> [cluster4.tekno-soft.it:04869] [[10231,0],1] node[0].name master
> daemon
> 0 arch ffc91200
> [cluster4.tekno-soft.it:04869] [[10231,0],1] node[1].name cluster4
> daemon 1 arch ffc91200
> [cluster4.tekno-soft.it:04869] [[10231,0],1] node[2].name cluster3
> daemon INVALID arch ffc91200
> [cluster4.tekno-soft.it:04869] [[10231,0],1] node[3].name cluster2
> daemon INVALID arch ffc91200
> [cluster4.tekno-soft.it:04869] [[10231,0],1] node[4].name cluster1
> daemon INVALID arch ffc91200
> [cluster4.tekno-soft.it:04869] [[10231,0],1] orted_recv: received
> sync+nidmap from local proc [[10231,2],0]
> [cluster4.tekno-soft.it:04869] [[10231,0],1] orted_cmd: received
> collective data cmd
> [master.tekno-soft.it:07844] [[10231,0],0] orted_cmd: received
> collective data cmd
> [master.tekno-soft.it:07844] [[10231,0],0] orted_cmd: received
> message_local_procs
> [cluster4.tekno-soft.it:04869] [[10231,0],1] orted_cmd: received
> message_local_procs
> [cluster4.tekno-soft.it:04869] [[10231,0],1] orted_cmd: received
> collective data cmd
> [master.tekno-soft.it:07844] [[10231,0],0] orted_cmd: received
> collective data cmd
> [master.tekno-soft.it:07844] [[10231,0],0] orted_cmd: received
> message_local_procs
> [cluster4.tekno-soft.it:04869] [[10231,0],1] orted_cmd: received
> message_local_procs
> Killed
> [cluster4.tekno-soft.it:04869] [[10231,0],1] routed:binomial:
> Connection
> to lifeline [[10231,0],0] lost
> [cluster4.tekno-soft.it:04869] [[10231,0],1] routed:binomial:
> Connection
> to lifeline [[10231,0],0] lost
> [roberto_at_master TestOpenMPI]$
>
> this one is *without-tm*
>
> [roberto_at_master TestOpenMPI]$ mpirun --verbose --debug-daemons -wdir
> "`pwd`" -np 1 testmaster 100000 $PBS_NODEFILE
> [master.tekno-soft.it:25143] [[23396,0],0] orted_cmd: received
> add_local_procs
> [master.tekno-soft.it:25143] [[23396,0],0] node[0].name master
> daemon 0
> arch ffc91200
> [master.tekno-soft.it:25143] [[23396,0],0] node[1].name cluster4
> daemon
> INVALID arch ffc91200
> [master.tekno-soft.it:25143] [[23396,0],0] node[2].name cluster3
> daemon
> INVALID arch ffc91200
> [master.tekno-soft.it:25143] [[23396,0],0] node[3].name cluster2
> daemon
> INVALID arch ffc91200
> [master.tekno-soft.it:25143] [[23396,0],0] node[4].name cluster1
> daemon
> INVALID arch ffc91200
> Initializing MPI ...
> [master.tekno-soft.it:25143] [[23396,0],0] orted_recv: received
> sync+nidmap from local proc [[23396,1],0]
> [master.tekno-soft.it:25143] [[23396,0],0] orted_cmd: received
> collective data cmd
> [master.tekno-soft.it:25143] [[23396,0],0] orted_cmd: received
> message_local_procs
> [master.tekno-soft.it:25143] [[23396,0],0] orted_cmd: received
> collective data cmd
> [master.tekno-soft.it:25143] [[23396,0],0] orted_cmd: received
> message_local_procs
> Loading the node's ring from file
> '/var/torque/aux//928.master.tekno-soft.it'
> ... adding node #1 host is 'cluster4.tekno-soft.it'
> ... adding node #2 host is 'cluster3.tekno-soft.it'
> ... adding node #3 host is 'cluster2.tekno-soft.it'
> ... adding node #4 host is 'cluster1.tekno-soft.it'
> A 4 node's ring has been made
> At least one node is available, let's start to distribute 100000 job
> across 4 nodes!!!
> ****************** Starting job #1
> ****************** Starting job #2
> ****************** Starting job #3
> ****************** Starting job #4
> Setting up the host as 'cluster4.tekno-soft.it'
> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
> Spawning a task 'testslave.sh' on node 'cluster4.tekno-soft.it'
> Setting up the host as 'cluster3.tekno-soft.it'
> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
> Spawning a task 'testslave.sh' on node 'cluster3.tekno-soft.it'
> Setting up the host as 'cluster2.tekno-soft.it'
> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
> Spawning a task 'testslave.sh' on node 'cluster2.tekno-soft.it'
> Setting up the host as 'cluster1.tekno-soft.it'
> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
> Spawning a task 'testslave.sh' on node 'cluster1.tekno-soft.it'
> [master.tekno-soft.it:25143] [[23396,0],0] orted_cmd: received
> add_local_procs
> [master.tekno-soft.it:25143] [[23396,0],0] node[0].name master
> daemon 0
> arch ffc91200
> [master.tekno-soft.it:25143] [[23396,0],0] node[1].name cluster4
> daemon
> 1 arch ffc91200
> [master.tekno-soft.it:25143] [[23396,0],0] node[2].name cluster3
> daemon
> INVALID arch ffc91200
> [master.tekno-soft.it:25143] [[23396,0],0] node[3].name cluster2
> daemon
> INVALID arch ffc91200
> [master.tekno-soft.it:25143] [[23396,0],0] node[4].name cluster1
> daemon
> INVALID arch ffc91200
> Daemon was launched on cluster4.tekno-soft.it - beginning to
> initialize
> Daemon [[23396,0],1] checking in as pid 3653 on host cluster4.tekno-
> soft.it
> Daemon [[23396,0],1] not using static ports
> [cluster4.tekno-soft.it:03653] [[23396,0],1] orted: up and running -
> waiting for commands!
> [cluster4.tekno-soft.it:03653] [[23396,0],1] orted_cmd: received
> add_local_procs
> [cluster4.tekno-soft.it:03653] [[23396,0],1] node[0].name master
> daemon
> 0 arch ffc91200
> [cluster4.tekno-soft.it:03653] [[23396,0],1] node[1].name cluster4
> daemon 1 arch ffc91200
> [cluster4.tekno-soft.it:03653] [[23396,0],1] node[2].name cluster3
> daemon INVALID arch ffc91200
> [cluster4.tekno-soft.it:03653] [[23396,0],1] node[3].name cluster2
> daemon INVALID arch ffc91200
> [cluster4.tekno-soft.it:03653] [[23396,0],1] node[4].name cluster1
> daemon INVALID arch ffc91200
> [cluster4.tekno-soft.it:03653] [[23396,0],1] orted_recv: received
> sync+nidmap from local proc [[23396,2],0]
> [cluster4.tekno-soft.it:03653] [[23396,0],1] orted_cmd: received
> collective data cmd
> [master.tekno-soft.it:25143] [[23396,0],0] orted_cmd: received
> collective data cmd
> [master.tekno-soft.it:25143] [[23396,0],0] orted_cmd: received
> message_local_procs
> [cluster4.tekno-soft.it:03653] [[23396,0],1] orted_cmd: received
> message_local_procs
> [cluster4.tekno-soft.it:03653] [[23396,0],1] orted_cmd: received
> collective data cmd
> [master.tekno-soft.it:25143] [[23396,0],0] orted_cmd: received
> collective data cmd
> [master.tekno-soft.it:25143] [[23396,0],0] orted_cmd: received
> message_local_procs
> [cluster4.tekno-soft.it:03653] [[23396,0],1] orted_cmd: received
> message_local_procs
>
> [... got a freeze here ... than ^C ...]
>
> mpirun: killing job...
>
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 25150 on node
> master.tekno-soft.it exited on signal 0 (Unknown signal 0).
> --------------------------------------------------------------------------
> [cluster4.tekno-soft.it:03653] [[23396,0],1] orted_cmd: received exit
> [cluster4.tekno-soft.it:03653] [[23396,0],1] orted: finalizing
> mpirun: clean termination accomplished
>
> [cluster4:03653] *** Process received signal ***
> [cluster4:03653] Signal: Segmentation fault (11)
> [cluster4:03653] Signal code: Address not mapped (1)
> [cluster4:03653] Failing at address: 0x2aaaab784af0
>
> So it seem that we have some problem in other places, maybe some other
> functions are not
> thread safe.
>
>> 3. remove the threaded launch scenario and just call comm_spawn in a
>> loop.
>>
>> In truth, the threaded approach to spawning all these procs isn't
>> gaining you anything. Torque will only do one launch at a time
>> anyway,
>> so you will launch them serially no matter what. You may just be
>> adding complexity for no real net gain.
> Talking about torque/pbs/maui, that's ok! It doen't handle multiple
> spawn at the same time.
> But in general, if I don't used _tm_ any more, I guess that we can
> get a
> gain on executing parallel spawn,
> because the spawn will be done using ssh/rsh.
>>
>> Ralph
>>
>> On Oct 1, 2008, at 1:56 AM, Roberto Fichera wrote:
>>
>>> Ralph Castain ha scritto:
>>>> Hi Roberto
>>>>
>>>> There is something wrong with this cmd line - perhaps it wasn't
>>>> copied
>>>> correctly?
>>>>
>>>> mpirun --verbose --debug-daemons --mca obl -np 1 -wdir `pwd`
>>>> testmaster 10000 $PBS_NODEFILE
>>>>
>>>> Specifically, the following is incomplete: --mca obl
>>>>
>>>> I'm not sure if this is the problem or not, but I am unaware of
>>>> such
>>>> an option and believe it could cause mpirun to become confused.
>>> Ops! Sorry, I copied the wrong log, below there the right one:
>>>
>>> [roberto_at_master TestOpenMPI]$ qsub -I testmaster.pbs
>>> qsub: waiting for job 920.master.tekno-soft.it to start
>>> qsub: job 920.master.tekno-soft.it ready
>>>
>>> [roberto_at_master TestMPICH2]$ cd /data/roberto/MPI/TestOpenMPI/
>>> [roberto_at_master TestOpenMPI]$ mpirun --debug-daemons --mca btl
>>> tcp,self
>>> -wdir "`pwd`" -np 1 testmaster 100000 $PBS_NODEFILE
>>> [master.tekno-soft.it:05407] [[11340,0],0] orted_cmd: received
>>> add_local_procs
>>> [master.tekno-soft.it:05407] [[11340,0],0] node[0].name master
>>> daemon 0
>>> arch ffc91200
>>> [master.tekno-soft.it:05407] [[11340,0],0] node[1].name cluster4
>>> daemon
>>> INVALID arch ffc91200
>>> [master.tekno-soft.it:05407] [[11340,0],0] node[2].name cluster3
>>> daemon
>>> INVALID arch ffc91200
>>> [master.tekno-soft.it:05407] [[11340,0],0] node[3].name cluster2
>>> daemon
>>> INVALID arch ffc91200
>>> [master.tekno-soft.it:05407] [[11340,0],0] node[4].name cluster1
>>> daemon
>>> INVALID arch ffc91200
>>> Initializing MPI ...
>>> [master.tekno-soft.it:05407] [[11340,0],0] orted_recv: received
>>> sync+nidmap from local proc [[11340,1],0]
>>> [master.tekno-soft.it:05407] [[11340,0],0] orted_cmd: received
>>> collective data cmd
>>> [master.tekno-soft.it:05407] [[11340,0],0] orted_cmd: received
>>> message_local_procs
>>> [master.tekno-soft.it:05407] [[11340,0],0] orted_cmd: received
>>> collective data cmd
>>> [master.tekno-soft.it:05407] [[11340,0],0] orted_cmd: received
>>> message_local_procs
>>> Loading the node's ring from file
>>> '/var/torque/aux//920.master.tekno-soft.it'
>>> ... adding node #1 host is 'cluster4.tekno-soft.it'
>>> ... adding node #2 host is 'cluster3.tekno-soft.it'
>>> ... adding node #3 host is 'cluster2.tekno-soft.it'
>>> ... adding node #4 host is 'cluster1.tekno-soft.it'
>>> A 4 node's ring has been made
>>> At least one node is available, let's start to distribute 100000 job
>>> across 4 nodes!!!
>>> ****************** Starting job #1
>>> ****************** Starting job #2
>>> ****************** Starting job #3
>>> Setting up the host as 'cluster4.tekno-soft.it'
>>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>>> Spawning a task 'testslave' on node 'cluster4.tekno-soft.it'
>>> Setting up the host as 'cluster3.tekno-soft.it'
>>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>>> Spawning a task 'testslave' on node 'cluster3.tekno-soft.it'
>>> Setting up the host as 'cluster2.tekno-soft.it'
>>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>>> Spawning a task 'testslave' on node 'cluster2.tekno-soft.it'
>>> Setting up the host as 'cluster1.tekno-soft.it'
>>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>>> Spawning a task 'testslave' on node 'cluster1.tekno-soft.it'
>>> ****************** Starting job #4
>>> Daemon was launched on cluster3.tekno-soft.it - beginning to
>>> initialize
>>> Daemon [[11340,0],1] checking in as pid 9487 on host
>>> cluster3.tekno-soft.it
>>> Daemon [[11340,0],1] not using static ports
>>> --------------------------------------------------------------------------
>>>
>>> A daemon (pid unknown) died unexpectedly on signal 1 while
>>> attempting to
>>> launch so we are aborting.
>>>
>>> There may be more information reported by the environment (see
>>> above).
>>>
>>> This may be because the daemon was unable to find all the needed
>>> shared
>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to
>>> have the
>>> location of the shared libraries on the remote nodes and this will
>>> automatically be forwarded to the remote nodes.
>>> --------------------------------------------------------------------------
>>>
>>> [master.tekno-soft.it:05407] [[11340,0],0] ORTE_ERROR_LOG:
>>> Resource busy
>>> in file base/plm_base_receive.c at line 169
>>> [master.tekno-soft.it:05414] [[11340,1],0] ORTE_ERROR_LOG: The
>>> specified
>>> application failed to start in file dpm_orte.c at line 677
>>> [master.tekno-soft.it:05414] *** An error occurred in MPI_Comm_spawn
>>> [master.tekno-soft.it:05414] *** on communicator MPI_COMM_WORLD
>>> [master.tekno-soft.it:05414] *** MPI_ERR_SPAWN: could not spawn
>>> processes
>>> [master.tekno-soft.it:05414] *** MPI_ERRORS_ARE_FATAL (goodbye)
>>>
>>> [master.tekno-soft.it:05407] [[11340,0],0] ORTE_ERROR_LOG:
>>> Resource busy
>>> in file base/plm_base_receive.c at line 169
>>> [master.tekno-soft.it:05407] [[11340,0],0] orted_cmd: received
>>> add_local_procs
>>> [master.tekno-soft.it:05407] [[11340,0],0] node[0].name master
>>> daemon 0
>>> arch ffc91200
>>> [master.tekno-soft.it:05407] [[11340,0],0] node[1].name cluster4
>>> daemon
>>> INVALID arch ffc91200
>>> [master.tekno-soft.it:05407] [[11340,0],0] node[2].name cluster3
>>> daemon
>>> 1 arch ffc91200
>>> [master.tekno-soft.it:05407] [[11340,0],0] node[3].name cluster2
>>> daemon
>>> INVALID arch ffc91200
>>> [master.tekno-soft.it:05407] [[11340,0],0] node[4].name cluster1
>>> daemon
>>> INVALID arch ffc91200
>>> [cluster3.tekno-soft.it:09487] [[11340,0],1] orted: up and running -
>>> waiting for commands!
>>>
>>>
>>>
>>>>
>>>> Ralph
>>>>
>>>>
>>>> On Sep 30, 2008, at 8:24 AM, Roberto Fichera wrote:
>>>>
>>>>> Roberto Fichera ha scritto:
>>>>>> Hi All on the list,
>>>>>>
>>>>>> I'm trying to execute dynamic MPI applications using
>>>>>> MPI_Comm_spawn().
>>>>>> The application I'm using for tests, basically is
>>>>>> composed by a master, which spawn a slave in each assigned node
>>>>>> in a
>>>>>> multithreading fashion. The master is started with a
>>>>>> number of jobs to perform and a filename, containing the list of
>>>>>> assigned nodes. The idea is to handle all the dispatching
>>>>>> logic within the application, so that the master will try to
>>>>>> take as
>>>>>> busy as possible each assigned node. Said that, for each spawned
>>>>>> job, the master allocate a thread for spawning and handling the
>>>>>> communication, than generate a random number, send it to the
>>>>>> slave which simply send it back to the master. Finally the slave
>>>>>> terminate its job and the relative node become free for a new
>>>>>> one.
>>>>>> The things will continue until all the requested jobs are done.
>>>>>>
>>>>>> The test program I'm using *doesn't* work flawless in mpich2
>>>>>> because it
>>>>>> has a ~24k spawned job limitation, due to a monotonically
>>>>>> increasing of its internal context id which basically stops the
>>>>>> application due to a library internal overflow. The internal
>>>>>> context
>>>>>> id,
>>>>>> allocated
>>>>>> for each terminated spawned job, are never recycled at moment.
>>>>>> The
>>>>>> unique MPI-2 implementation, so supporting MPI_Comm_spawn(),
>>>>>> which was able to complete the test is currently the HP MPI. So
>>>>>> now I
>>>>>> would start to check OpenMPI if it's suitable for our dynamic
>>>>>> parallel
>>>>>> applications.
>>>>>>
>>>>>> The test application is linked against OpenMPI v1.3a1r19645,
>>>>>> running of
>>>>>> Fedora8 x86_64 + all updates.
>>>>>>
>>>>>> My first attempt end up on the error below which I basically
>>>>>> don't
>>>>>> know
>>>>>> where to look further. Note that I've already checked PATHs and
>>>>>> LD_LIBRARY_PATH, the application is basically configured
>>>>>> correctly
>>>>>> since
>>>>>> it uses two scripts for starting and all the paths are set there.
>>>>>> Basically I need to start *one* master application which will
>>>>>> handle
>>>>>> all
>>>>>> the things for managing slave applications. The communication is
>>>>>> *only*
>>>>>> master <-> slave and never collective, at moment.
>>>>>>
>>>>>> The test program is available on request.
>>>>>>
>>>>>> Does any one have an idea what's going on?
>>>>>>
>>>>>> Thanks in advance,
>>>>>> Roberto Fichera.
>>>>>>
>>>>>> [roberto_at_cluster4 TestOpenMPI]$ orterun -wdir
>>>>>> /data/roberto/MPI/TestOpenMPI -np
>>>>>> 1 testmaster 10000 $PBS_NODEFILE
>>>>>> Initializing MPI ...
>>>>>> Loading the node's ring from file
>>>>>> '/var/torque/aux//909.master.tekno-soft.it'
>>>>>> ... adding node #1 host is 'cluster3.tekno-soft.it'
>>>>>> ... adding node #2 host is 'cluster2.tekno-soft.it'
>>>>>> ... adding node #3 host is 'cluster1.tekno-soft.it'
>>>>>> ... adding node #4 host is 'master.tekno-soft.it'
>>>>>> A 4 node's ring has been made
>>>>>> At least one node is available, let's start to distribute 10000
>>>>>> job
>>>>>> across 4
>>>>>> nodes!!!
>>>>>> ****************** Starting job #1
>>>>>> ****************** Starting job #2
>>>>>> ****************** Starting job #3
>>>>>> ****************** Starting job #4
>>>>>> Setting up the host as 'cluster3.tekno-soft.it'
>>>>>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>>>>>> Spawning a task './testslave.sh' on node 'cluster3.tekno-soft.it'
>>>>>> Setting up the host as 'cluster2.tekno-soft.it'
>>>>>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>>>>>> Spawning a task './testslave.sh' on node 'cluster2.tekno-soft.it'
>>>>>> Setting up the host as 'cluster1.tekno-soft.it'
>>>>>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>>>>>> Spawning a task './testslave.sh' on node 'cluster1.tekno-soft.it'
>>>>>> Setting up the host as 'master.tekno-soft.it'
>>>>>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>>>>>> Spawning a task './testslave.sh' on node 'master.tekno-soft.it'
>>>>>> --------------------------------------------------------------------------
>>>>>>
>>>>>>
>>>>>> A daemon (pid unknown) died unexpectedly on signal 1 while
>>>>>> attempting to
>>>>>> launch so we are aborting.
>>>>>>
>>>>>> There may be more information reported by the environment (see
>>>>>> above).
>>>>>>
>>>>>> This may be because the daemon was unable to find all the needed
>>>>>> shared
>>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to
>>>>>> have the
>>>>>> location of the shared libraries on the remote nodes and this
>>>>>> will
>>>>>> automatically be forwarded to the remote nodes.
>>>>>> --------------------------------------------------------------------------
>>>>>>
>>>>>>
>>>>>> [cluster4.tekno-soft.it:21287] [[30014,0],0] ORTE_ERROR_LOG:
>>>>>> Resource busy in
>>>>>> file base/plm_base_receive.c at line 169
>>>>>> [cluster4.tekno-soft.it:21287] [[30014,0],0] ORTE_ERROR_LOG:
>>>>>> Resource busy in
>>>>>> file base/plm_base_receive.c at line 169
>>>>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users