Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] error performing MPI_Comm_spawn
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-12-15 09:58:10

Looks to me like it is a race condition, and the timing between 1.3.3 and 1.4 is just enough to trip it. I can break the race, but it will have to be in a future fix release.

Meantime, your best bet is to either stick with 1.3.3 or add the delay.

On Dec 15, 2009, at 5:51 AM, Marcia Cristina Cera wrote:

> Hi,
> I intend to develop an application using the MPI_Comm_spawn to create dynamically new MPI tasks (or processes).
> The structure of the program is like a tree: each node creates 2 new ones until reaches a predefined number of levels.
> I developed a small program to explain my problem as can be seen in attachment.
> -- start.c: launches (through MPI_Comm_spawn, in which the argv has the level value) the root of the tree (a ch_rec program). Afterward spawn, a message is sent to child and the process block in an MPI_Recv.
> -- ch_rec.c: gets its level value and receives the parent message, then if its level is less than a predefined limit, it will creates 2 children:
> - set the level value;
> - spawn 1 child;
> - send a message;
> - call an MPI_Irecv;
> - repeat the 4 previous steps for the second child;
> - call an MPI_Waitany waiting for children returns.
> When children messages are received, the process send a message to its parent and call MPI_Finalize.
> Using the openmpi-1.3.3 version the program runs as expected but with openmpi-1.4 I get the following error:
> $ mpirun -np 1 start
> level 0
> level = 1
> Parent sent: level 0 (pid:4279)
> level = 2
> Parent sent: level 1 (pid:4281)
> [] [[42824,0],0] ORTE_ERROR_LOG: Not found in file base/plm_base_launch_support.c at line 758
> The error happens when my program try to launch the second child immediately after the first spawn call.
> In my tests I try to put an sleep of 2 second between the first and the second spawn, and then the program runs as expected.
> Some one can help me with this version 1.4 bug?
> thanks,
> márcia.
> <spawn-problem.tar.gz>_______________________________________________
> users mailing list
> users_at_[hidden]