Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Running application with MPI_Comm_spawn() in multithreaded environment
From: Ralph Castain (rhc_at_[hidden])
Date: 2008-09-30 16:43:27


Hi Roberto

There is something wrong with this cmd line - perhaps it wasn't copied
correctly?

mpirun --verbose --debug-daemons --mca obl -np 1 -wdir `pwd`
testmaster 10000 $PBS_NODEFILE

Specifically, the following is incomplete: --mca obl

I'm not sure if this is the problem or not, but I am unaware of such
an option and believe it could cause mpirun to become confused.

Ralph

On Sep 30, 2008, at 8:24 AM, Roberto Fichera wrote:

> Roberto Fichera ha scritto:
>> Hi All on the list,
>>
>> I'm trying to execute dynamic MPI applications using
>> MPI_Comm_spawn().
>> The application I'm using for tests, basically is
>> composed by a master, which spawn a slave in each assigned node in a
>> multithreading fashion. The master is started with a
>> number of jobs to perform and a filename, containing the list of
>> assigned nodes. The idea is to handle all the dispatching
>> logic within the application, so that the master will try to take as
>> busy as possible each assigned node. Said that, for each spawned
>> job, the master allocate a thread for spawning and handling the
>> communication, than generate a random number, send it to the
>> slave which simply send it back to the master. Finally the slave
>> terminate its job and the relative node become free for a new one.
>> The things will continue until all the requested jobs are done.
>>
>> The test program I'm using *doesn't* work flawless in mpich2
>> because it
>> has a ~24k spawned job limitation, due to a monotonically
>> increasing of its internal context id which basically stops the
>> application due to a library internal overflow. The internal
>> context id,
>> allocated
>> for each terminated spawned job, are never recycled at moment. The
>> unique MPI-2 implementation, so supporting MPI_Comm_spawn(),
>> which was able to complete the test is currently the HP MPI. So now I
>> would start to check OpenMPI if it's suitable for our dynamic
>> parallel
>> applications.
>>
>> The test application is linked against OpenMPI v1.3a1r19645,
>> running of
>> Fedora8 x86_64 + all updates.
>>
>> My first attempt end up on the error below which I basically don't
>> know
>> where to look further. Note that I've already checked PATHs and
>> LD_LIBRARY_PATH, the application is basically configured correctly
>> since
>> it uses two scripts for starting and all the paths are set there.
>> Basically I need to start *one* master application which will
>> handle all
>> the things for managing slave applications. The communication is
>> *only*
>> master <-> slave and never collective, at moment.
>>
>> The test program is available on request.
>>
>> Does any one have an idea what's going on?
>>
>> Thanks in advance,
>> Roberto Fichera.
>>
>> [roberto_at_cluster4 TestOpenMPI]$ orterun -wdir /data/roberto/MPI/
>> TestOpenMPI -np
>> 1 testmaster 10000 $PBS_NODEFILE
>> Initializing MPI ...
>> Loading the node's ring from file '/var/torque/aux//
>> 909.master.tekno-soft.it'
>> ... adding node #1 host is 'cluster3.tekno-soft.it'
>> ... adding node #2 host is 'cluster2.tekno-soft.it'
>> ... adding node #3 host is 'cluster1.tekno-soft.it'
>> ... adding node #4 host is 'master.tekno-soft.it'
>> A 4 node's ring has been made
>> At least one node is available, let's start to distribute 10000 job
>> across 4
>> nodes!!!
>> ****************** Starting job #1
>> ****************** Starting job #2
>> ****************** Starting job #3
>> ****************** Starting job #4
>> Setting up the host as 'cluster3.tekno-soft.it'
>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>> Spawning a task './testslave.sh' on node 'cluster3.tekno-soft.it'
>> Setting up the host as 'cluster2.tekno-soft.it'
>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>> Spawning a task './testslave.sh' on node 'cluster2.tekno-soft.it'
>> Setting up the host as 'cluster1.tekno-soft.it'
>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>> Spawning a task './testslave.sh' on node 'cluster1.tekno-soft.it'
>> Setting up the host as 'master.tekno-soft.it'
>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>> Spawning a task './testslave.sh' on node 'master.tekno-soft.it'
>> --------------------------------------------------------------------------
>> A daemon (pid unknown) died unexpectedly on signal 1 while
>> attempting to
>> launch so we are aborting.
>>
>> There may be more information reported by the environment (see
>> above).
>>
>> This may be because the daemon was unable to find all the needed
>> shared
>> libraries on the remote node. You may set your LD_LIBRARY_PATH to
>> have the
>> location of the shared libraries on the remote nodes and this will
>> automatically be forwarded to the remote nodes.
>> --------------------------------------------------------------------------
>> [cluster4.tekno-soft.it:21287] [[30014,0],0] ORTE_ERROR_LOG:
>> Resource busy in
>> file base/plm_base_receive.c at line 169
>> [cluster4.tekno-soft.it:21287] [[30014,0],0] ORTE_ERROR_LOG:
>> Resource busy in
>> file base/plm_base_receive.c at line 169
>>
> Just to say that I made a little progress, now seems that everything
> starts, but mpirun doesn't
> find the executable
>
> [roberto_at_cluster4 TestOpenMPI]$ mpirun --verbose --debug-daemons --mca
> obl -np 1 -wdir `pwd` testmaster 10000 $PBS_NODEFILE
> Daemon was launched on cluster3.tekno-soft.it - beginning to
> initialize
> Daemon was launched on cluster2.tekno-soft.it - beginning to
> initialize
> Daemon was launched on cluster1.tekno-soft.it - beginning to
> initialize
> Daemon [[14600,0],2] checking in as pid 28732 on host cluster2.tekno-
> soft.it
> Daemon [[14600,0],2] not using static ports
> [cluster2.tekno-soft.it:28732] [[14600,0],2] orted: up and running -
> waiting for commands!
> Daemon [[14600,0],3] checking in as pid 2590 on host cluster1.tekno-
> soft.it
> Daemon [[14600,0],3] not using static ports
> [cluster1.tekno-soft.it:02590] [[14600,0],3] orted: up and running -
> waiting for commands!
> Daemon [[14600,0],1] checking in as pid 6969 on host cluster3.tekno-
> soft.it
> Daemon [[14600,0],1] not using static ports
> [cluster3.tekno-soft.it:06969] [[14600,0],1] orted: up and running -
> waiting for commands!
> Daemon was launched on master.tekno-soft.it - beginning to initialize
> Daemon [[14600,0],4] checking in as pid 1113 on host master.tekno-
> soft.it
> Daemon [[14600,0],4] not using static ports
> [master.tekno-soft.it:01113] [[14600,0],4] orted: up and running -
> waiting for commands!
> [cluster4.tekno-soft.it:07953] [[14600,0],0] orted_cmd: received
> add_local_procs
> [cluster4.tekno-soft.it:07953] [[14600,0],0] node[0].name cluster4
> daemon 0 arch ffc91200
> [cluster4.tekno-soft.it:07953] [[14600,0],0] node[1].name cluster3
> daemon 1 arch ffc91200
> [cluster4.tekno-soft.it:07953] [[14600,0],0] node[2].name cluster2
> daemon 2 arch ffc91200
> [cluster4.tekno-soft.it:07953] [[14600,0],0] node[3].name cluster1
> daemon 3 arch ffc91200
> [cluster4.tekno-soft.it:07953] [[14600,0],0] node[4].name master
> daemon
> 4 arch ffc91200
> [cluster3.tekno-soft.it:06969] [[14600,0],1] orted_cmd: received
> add_local_procs
> [cluster2.tekno-soft.it:28732] [[14600,0],2] orted_cmd: received
> add_local_procs
> [master.tekno-soft.it:01113] [[14600,0],4] orted_cmd: received
> add_local_procs
> [cluster3.tekno-soft.it:06969] [[14600,0],1] node[0].name cluster4
> daemon 0 arch ffc91200
> [cluster3.tekno-soft.it:06969] [[14600,0],1] node[1].name cluster3
> daemon 1 arch ffc91200
> [cluster3.tekno-soft.it:06969] [[14600,0],1] node[2].name cluster2
> daemon 2 [cluster2.tekno-soft.it:28732] [[14600,0],2] node[0].name
> cluster4 daemon 0 arch ffc91200
> [cluster2.tekno-soft.it:28732] [[14600,0],2] node[1].name cluster3
> daemon 1 arch ffc91200
> [cluster2.tekno-soft.it:28732] [[14600,0],2] node[2].name cluster2
> daemon 2 [master.tekno-soft.it:01113] [[14600,0],4] node[0].name
> cluster4 daemon 0 arch ffc91200
> [master.tekno-soft.it:01113] [[14600,0],4] node[1].name cluster3
> daemon
> 1 arch ffc91200
> [master.tekno-soft.it:01113] [[14600,0],4] node[2].name cluster2
> daemon
> 2 arch farch ffc91200
> [cluster3.tekno-soft.it:06969] [[14600,0],1] node[3].name cluster1
> daemon 3 arch ffc91200
> [cluster3.tekno-soft.it:06969] [[14600,0],1] node[4].name master
> daemon
> 4 arch ffc91200
> arch ffc91200
> [cluster2.tekno-soft.it:28732] [[14600,0],2] node[3].name cluster1
> daemon 3 arch ffc91200
> [cluster2.tekno-soft.it:28732] [[14600,0],2] node[4].name master
> daemon
> 4 arch ffc91200
> fc91200
> [master.tekno-soft.it:01113] [[14600,0],4] node[3].name cluster1
> daemon
> 3 arch ffc91200
> [master.tekno-soft.it:01113] [[14600,0],4] node[4].name master
> daemon 4
> arch ffc91200
> --------------------------------------------------------------------------
> mpirun was unable to launch the specified application as it could not
> find an executable:
>
> Executable: 1
> Node: cluster4.tekno-soft.it
>
> while attempting to start process rank 0.
> --------------------------------------------------------------------------
> [master.tekno-soft.it:01113] [[14600,0],4] orted_cmd: received exit
> [master.tekno-soft.it:01113] [[14600,0],4] orted: finalizing
> [cluster2.tekno-soft.it:28732] [[14600,0],2] orted_cmd: received exit
> [cluster2.tekno-soft.it:28732] [[14600,0],2] orted: finalizing
> [master:01113] *** Process received signal ***
> [cluster2:28732] *** Process received signal ***
> [cluster2:28732] Signal: Segmentation fault (11)
> [cluster2:28732] Signal code: Address not mapped (1)
> [cluster2:28732] Failing at address: 0x2aaaab784af0
> [master:01113] Signal: Segmentation fault (11)
> [master:01113] Signal code: Address not mapped (1)
> [master:01113] Failing at address: 0x2aaaab786af0
> mpirun: abort is already in progress...hit ctrl-c again to forcibly
> terminate
>
> [cluster1.tekno-soft.it:02590] [[14600,0],3] routed:binomial:
> Connection
> to lifeline [[14600,0],0] lost
> [roberto_at_cluster4 TestOpenMPI]$
>
>>
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users