Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] error performing MPI_Comm_spawn
From: Marcia Cristina Cera (marcia.cristina.cera_at_[hidden])
Date: 2009-12-16 07:43:10


Hi Ralph,

I am afraid I have been a little hasty!
I remake my tests with more care and I got the same error also with the
1.3.3 :-/
but in such version the error happens after some successful executions...
because of that I did not realize before!
Furthermore, I increased the number of levels of the tree (that means have
more concurrently dynamic process creations in the lower levels) and I never
arrive to execute without error, unless I add the delay.
Perhaps the problem might even be a race condition :(

I test with LAM/MPI 7.1.4 and in a first moment it works fine. I have work
with LAM for years, but I migrate o OpenMP last year once LAM will be
discontinued...

I think that I can continue the development of my application adding the
delay, while I wait for a release... and I leave the performance tests to be
made in the future :)

Thank you again Ralph,
márcia.

On Wed, Dec 16, 2009 at 2:17 AM, Ralph Castain <rhc_at_[hidden]> wrote:

> Okay, I can replicate this.
>
> FWIW: your test program works fine with the OMPI trunk and 1.3.3. It only
> has a problem with 1.4. Since I can replicate it on multiple machines every
> single time, I don't think it is actually a race condition.
>
> I think someone made a change to the 1.4 branch that created a failure mode
> :-/
>
> Will have to get back to you on this - may take awhile, and won't be in the
> 1.4.1 release.
>
> Thanks for the replicator!
>
> On Dec 15, 2009, at 8:35 AM, Marcia Cristina Cera wrote:
>
> Thank you, Ralph
>
> I will use the 1.3.3 for now...
> while waiting for a future fix release that break this race condiction.
>
> márcia
>
> On Tue, Dec 15, 2009 at 12:58 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>
>> Looks to me like it is a race condition, and the timing between 1.3.3 and
>> 1.4 is just enough to trip it. I can break the race, but it will have to be
>> in a future fix release.
>>
>> Meantime, your best bet is to either stick with 1.3.3 or add the delay.
>>
>> On Dec 15, 2009, at 5:51 AM, Marcia Cristina Cera wrote:
>>
>> Hi,
>>
>> I intend to develop an application using the MPI_Comm_spawn to create
>> dynamically new MPI tasks (or processes).
>> The structure of the program is like a tree: each node creates 2 new ones
>> until reaches a predefined number of levels.
>>
>> I developed a small program to explain my problem as can be seen in
>> attachment.
>> -- start.c: launches (through MPI_Comm_spawn, in which the argv has the
>> level value) the root of the tree (a ch_rec program). Afterward spawn, a
>> message is sent to child and the process block in an MPI_Recv.
>> -- ch_rec.c: gets its level value and receives the parent message, then if
>> its level is less than a predefined limit, it will creates 2 children:
>> - set the level value;
>> - spawn 1 child;
>> - send a message;
>> - call an MPI_Irecv;
>> - repeat the 4 previous steps for the second child;
>> - call an MPI_Waitany waiting for children returns.
>> When children messages are received, the process send a message to its
>> parent and call MPI_Finalize.
>>
>> Using the openmpi-1.3.3 version the program runs as expected but with
>> openmpi-1.4 I get the following error:
>>
>> $ mpirun -np 1 start
>> level 0
>> level = 1
>> Parent sent: level 0 (pid:4279)
>> level = 2
>> Parent sent: level 1 (pid:4281)
>> [xiru-8.portoalegre.grenoble.grid5000.fr:04278] [[42824,0],0]
>> ORTE_ERROR_LOG: Not found in file base/plm_base_launch_support.c at line 758
>>
>> The error happens when my program try to launch the second child
>> immediately after the first spawn call.
>> In my tests I try to put an sleep of 2 second between the first and the
>> second spawn, and then the program runs as expected.
>>
>> Some one can help me with this version 1.4 bug?
>>
>> thanks,
>> márcia.
>>
>> <spawn-problem.tar.gz>_______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>