Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] error performing MPI_Comm_spawn
From: Marcia Cristina Cera (marcia.cristina.cera_at_[hidden])
Date: 2009-12-15 10:35:53


Thank you, Ralph

I will use the 1.3.3 for now...
while waiting for a future fix release that break this race condiction.

márcia

On Tue, Dec 15, 2009 at 12:58 PM, Ralph Castain <rhc_at_[hidden]> wrote:

> Looks to me like it is a race condition, and the timing between 1.3.3 and
> 1.4 is just enough to trip it. I can break the race, but it will have to be
> in a future fix release.
>
> Meantime, your best bet is to either stick with 1.3.3 or add the delay.
>
> On Dec 15, 2009, at 5:51 AM, Marcia Cristina Cera wrote:
>
> Hi,
>
> I intend to develop an application using the MPI_Comm_spawn to create
> dynamically new MPI tasks (or processes).
> The structure of the program is like a tree: each node creates 2 new ones
> until reaches a predefined number of levels.
>
> I developed a small program to explain my problem as can be seen in
> attachment.
> -- start.c: launches (through MPI_Comm_spawn, in which the argv has the
> level value) the root of the tree (a ch_rec program). Afterward spawn, a
> message is sent to child and the process block in an MPI_Recv.
> -- ch_rec.c: gets its level value and receives the parent message, then if
> its level is less than a predefined limit, it will creates 2 children:
> - set the level value;
> - spawn 1 child;
> - send a message;
> - call an MPI_Irecv;
> - repeat the 4 previous steps for the second child;
> - call an MPI_Waitany waiting for children returns.
> When children messages are received, the process send a message to its
> parent and call MPI_Finalize.
>
> Using the openmpi-1.3.3 version the program runs as expected but with
> openmpi-1.4 I get the following error:
>
> $ mpirun -np 1 start
> level 0
> level = 1
> Parent sent: level 0 (pid:4279)
> level = 2
> Parent sent: level 1 (pid:4281)
> [xiru-8.portoalegre.grenoble.grid5000.fr:04278] [[42824,0],0]
> ORTE_ERROR_LOG: Not found in file base/plm_base_launch_support.c at line 758
>
> The error happens when my program try to launch the second child
> immediately after the first spawn call.
> In my tests I try to put an sleep of 2 second between the first and the
> second spawn, and then the program runs as expected.
>
> Some one can help me with this version 1.4 bug?
>
> thanks,
> márcia.
>
> <spawn-problem.tar.gz>_______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>