Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Running application with MPI_Comm_spawn() in multithreaded environment
From: Roberto Fichera (kernel_at_[hidden])
Date: 2008-09-30 10:24:09


Roberto Fichera ha scritto:
> Hi All on the list,
>
> I'm trying to execute dynamic MPI applications using MPI_Comm_spawn().
> The application I'm using for tests, basically is
> composed by a master, which spawn a slave in each assigned node in a
> multithreading fashion. The master is started with a
> number of jobs to perform and a filename, containing the list of
> assigned nodes. The idea is to handle all the dispatching
> logic within the application, so that the master will try to take as
> busy as possible each assigned node. Said that, for each spawned
> job, the master allocate a thread for spawning and handling the
> communication, than generate a random number, send it to the
> slave which simply send it back to the master. Finally the slave
> terminate its job and the relative node become free for a new one.
> The things will continue until all the requested jobs are done.
>
> The test program I'm using *doesn't* work flawless in mpich2 because it
> has a ~24k spawned job limitation, due to a monotonically
> increasing of its internal context id which basically stops the
> application due to a library internal overflow. The internal context id,
> allocated
> for each terminated spawned job, are never recycled at moment. The
> unique MPI-2 implementation, so supporting MPI_Comm_spawn(),
> which was able to complete the test is currently the HP MPI. So now I
> would start to check OpenMPI if it's suitable for our dynamic parallel
> applications.
>
> The test application is linked against OpenMPI v1.3a1r19645, running of
> Fedora8 x86_64 + all updates.
>
> My first attempt end up on the error below which I basically don't know
> where to look further. Note that I've already checked PATHs and
> LD_LIBRARY_PATH, the application is basically configured correctly since
> it uses two scripts for starting and all the paths are set there.
> Basically I need to start *one* master application which will handle all
> the things for managing slave applications. The communication is *only*
> master <-> slave and never collective, at moment.
>
> The test program is available on request.
>
> Does any one have an idea what's going on?
>
> Thanks in advance,
> Roberto Fichera.
>
> [roberto_at_cluster4 TestOpenMPI]$ orterun -wdir /data/roberto/MPI/TestOpenMPI -np
> 1 testmaster 10000 $PBS_NODEFILE
> Initializing MPI ...
> Loading the node's ring from file '/var/torque/aux//909.master.tekno-soft.it'
> ... adding node #1 host is 'cluster3.tekno-soft.it'
> ... adding node #2 host is 'cluster2.tekno-soft.it'
> ... adding node #3 host is 'cluster1.tekno-soft.it'
> ... adding node #4 host is 'master.tekno-soft.it'
> A 4 node's ring has been made
> At least one node is available, let's start to distribute 10000 job across 4
> nodes!!!
> ****************** Starting job #1
> ****************** Starting job #2
> ****************** Starting job #3
> ****************** Starting job #4
> Setting up the host as 'cluster3.tekno-soft.it'
> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
> Spawning a task './testslave.sh' on node 'cluster3.tekno-soft.it'
> Setting up the host as 'cluster2.tekno-soft.it'
> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
> Spawning a task './testslave.sh' on node 'cluster2.tekno-soft.it'
> Setting up the host as 'cluster1.tekno-soft.it'
> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
> Spawning a task './testslave.sh' on node 'cluster1.tekno-soft.it'
> Setting up the host as 'master.tekno-soft.it'
> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
> Spawning a task './testslave.sh' on node 'master.tekno-soft.it'
> --------------------------------------------------------------------------
> A daemon (pid unknown) died unexpectedly on signal 1 while attempting to
> launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> [cluster4.tekno-soft.it:21287] [[30014,0],0] ORTE_ERROR_LOG: Resource busy in
> file base/plm_base_receive.c at line 169
> [cluster4.tekno-soft.it:21287] [[30014,0],0] ORTE_ERROR_LOG: Resource busy in
> file base/plm_base_receive.c at line 169
>
Just to say that I made a little progress, now seems that everything
starts, but mpirun doesn't
find the executable

[roberto_at_cluster4 TestOpenMPI]$ mpirun --verbose --debug-daemons --mca
obl -np 1 -wdir `pwd` testmaster 10000 $PBS_NODEFILE
Daemon was launched on cluster3.tekno-soft.it - beginning to initialize
Daemon was launched on cluster2.tekno-soft.it - beginning to initialize
Daemon was launched on cluster1.tekno-soft.it - beginning to initialize
Daemon [[14600,0],2] checking in as pid 28732 on host cluster2.tekno-soft.it
Daemon [[14600,0],2] not using static ports
[cluster2.tekno-soft.it:28732] [[14600,0],2] orted: up and running -
waiting for commands!
Daemon [[14600,0],3] checking in as pid 2590 on host cluster1.tekno-soft.it
Daemon [[14600,0],3] not using static ports
[cluster1.tekno-soft.it:02590] [[14600,0],3] orted: up and running -
waiting for commands!
Daemon [[14600,0],1] checking in as pid 6969 on host cluster3.tekno-soft.it
Daemon [[14600,0],1] not using static ports
[cluster3.tekno-soft.it:06969] [[14600,0],1] orted: up and running -
waiting for commands!
Daemon was launched on master.tekno-soft.it - beginning to initialize
Daemon [[14600,0],4] checking in as pid 1113 on host master.tekno-soft.it
Daemon [[14600,0],4] not using static ports
[master.tekno-soft.it:01113] [[14600,0],4] orted: up and running -
waiting for commands!
[cluster4.tekno-soft.it:07953] [[14600,0],0] orted_cmd: received
add_local_procs
[cluster4.tekno-soft.it:07953] [[14600,0],0] node[0].name cluster4
daemon 0 arch ffc91200
[cluster4.tekno-soft.it:07953] [[14600,0],0] node[1].name cluster3
daemon 1 arch ffc91200
[cluster4.tekno-soft.it:07953] [[14600,0],0] node[2].name cluster2
daemon 2 arch ffc91200
[cluster4.tekno-soft.it:07953] [[14600,0],0] node[3].name cluster1
daemon 3 arch ffc91200
[cluster4.tekno-soft.it:07953] [[14600,0],0] node[4].name master daemon
4 arch ffc91200
[cluster3.tekno-soft.it:06969] [[14600,0],1] orted_cmd: received
add_local_procs
[cluster2.tekno-soft.it:28732] [[14600,0],2] orted_cmd: received
add_local_procs
[master.tekno-soft.it:01113] [[14600,0],4] orted_cmd: received
add_local_procs
[cluster3.tekno-soft.it:06969] [[14600,0],1] node[0].name cluster4
daemon 0 arch ffc91200
[cluster3.tekno-soft.it:06969] [[14600,0],1] node[1].name cluster3
daemon 1 arch ffc91200
[cluster3.tekno-soft.it:06969] [[14600,0],1] node[2].name cluster2
daemon 2 [cluster2.tekno-soft.it:28732] [[14600,0],2] node[0].name
cluster4 daemon 0 arch ffc91200
[cluster2.tekno-soft.it:28732] [[14600,0],2] node[1].name cluster3
daemon 1 arch ffc91200
[cluster2.tekno-soft.it:28732] [[14600,0],2] node[2].name cluster2
daemon 2 [master.tekno-soft.it:01113] [[14600,0],4] node[0].name
cluster4 daemon 0 arch ffc91200
[master.tekno-soft.it:01113] [[14600,0],4] node[1].name cluster3 daemon
1 arch ffc91200
[master.tekno-soft.it:01113] [[14600,0],4] node[2].name cluster2 daemon
2 arch farch ffc91200
[cluster3.tekno-soft.it:06969] [[14600,0],1] node[3].name cluster1
daemon 3 arch ffc91200
[cluster3.tekno-soft.it:06969] [[14600,0],1] node[4].name master daemon
4 arch ffc91200
arch ffc91200
[cluster2.tekno-soft.it:28732] [[14600,0],2] node[3].name cluster1
daemon 3 arch ffc91200
[cluster2.tekno-soft.it:28732] [[14600,0],2] node[4].name master daemon
4 arch ffc91200
fc91200
[master.tekno-soft.it:01113] [[14600,0],4] node[3].name cluster1 daemon
3 arch ffc91200
[master.tekno-soft.it:01113] [[14600,0],4] node[4].name master daemon 4
arch ffc91200
--------------------------------------------------------------------------
mpirun was unable to launch the specified application as it could not
find an executable:

Executable: 1
Node: cluster4.tekno-soft.it

while attempting to start process rank 0.
--------------------------------------------------------------------------
[master.tekno-soft.it:01113] [[14600,0],4] orted_cmd: received exit
[master.tekno-soft.it:01113] [[14600,0],4] orted: finalizing
[cluster2.tekno-soft.it:28732] [[14600,0],2] orted_cmd: received exit
[cluster2.tekno-soft.it:28732] [[14600,0],2] orted: finalizing
[master:01113] *** Process received signal ***
[cluster2:28732] *** Process received signal ***
[cluster2:28732] Signal: Segmentation fault (11)
[cluster2:28732] Signal code: Address not mapped (1)
[cluster2:28732] Failing at address: 0x2aaaab784af0
[master:01113] Signal: Segmentation fault (11)
[master:01113] Signal code: Address not mapped (1)
[master:01113] Failing at address: 0x2aaaab786af0
mpirun: abort is already in progress...hit ctrl-c again to forcibly
terminate

[cluster1.tekno-soft.it:02590] [[14600,0],3] routed:binomial: Connection
to lifeline [[14600,0],0] lost
[roberto_at_cluster4 TestOpenMPI]$

>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>