Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Segfault in 1.3 branch
From: Pavel Shamis (Pasha) (pasha_at_[hidden])
Date: 2008-07-15 02:10:29


> It looks like a new issue to me, Pasha. Possibly a side consequence of the
> IOF change made by Jeff and I the other day. From what I can see, it looks
> like you app was a simple "hello" - correct?
>
Yep, it is simple hello application.
> If you look at the error, the problem occurs when mpirun is trying to route
> a message. Since the app is clearly running at this time, the problem is
> probably in the IOF. The error message shows that mpirun is attempting to
> route a message to a jobid that doesn't exist. We have a test in the RML
> that forces an "abort" if that occurs.
>
> I would guess that there is either a race condition or memory corruption
> occurring somewhere, but I have no idea where.
>
> This may be the "new hole in the dyke" I cautioned about in earlier notes
> regarding the IOF... :-)
>
> Still, given that this hits rarely, it probably is a more acceptable bug to
> leave in the code than the one we just fixed (duplicated stdin)...
>
It is not so rare issue, 19 failures in my MTT run
(http://www.open-mpi.org/mtt/index.php?do_redir=765).

Pasha
> Ralph
>
>
>
> On 7/14/08 1:11 AM, "Pavel Shamis (Pasha)" <pasha_at_[hidden]> wrote:
>
>
>> Please see http://www.open-mpi.org/mtt/index.php?do_redir=764
>>
>> The error is not consistent. It takes a lot of iteration to reproduce it.
>> In my MTT testing I seen it few times.
>>
>> Is it know issue ?
>>
>> Regards,
>> Pasha
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>