Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] MTT tests: segv's with sm on large messages
From: Eugene Loh (Eugene.Loh_at_[hidden])
Date: 2009-05-05 20:08:37


Jeff Squyres wrote:

> On May 5, 2009, at 6:01 PM, Eugene Loh wrote:
>
>> You and Terry saw something that was occurring about 0.01% of the time
>> during MPI_Init during add_procs. That does not seem to be what we are
>> seeing here.
>
> Right -- that's what I'm saying. It's different than the MPI_INIT
> errors.

I was trying to say that there are two kinds of MPI_Init errors. One,
which you and Terry have seen, is in add_procs and shows up about 0.01%
of the time. The other, um, is not and occurs more like 1% of the
time. I'm not real sure what "1%" means. It isn't always 1%. But the
times I've seen it has been in MTT runs in which there are dozens of
failures among thousands of runs.

>> But we have seen failures in 1.3.1 and 1.3.2 that look like the one
>> here. They occur more like 1% of the time and can occur during
>> MPI_Init
>> *OR* later during a collective call. What we're looking at here seems
>> to be related. E.g., see
>> http://www.open-mpi.org/community/lists/devel/2009/03/5768.php
>
> Good to see that we're agreeing.
>
> Yes, I agree that this is not a new error, but it is worth fixing.
> Cisco's MTT didn't run last night because there was no new trunk
> tarball last night. I'll check Cisco's MTT tomorrow morning and see
> if there are any sm failures of this new flavor, and how frequently
> they're happening.

I just took a stroll down memory lane and these errors seem to be harder
to find than I thought. But, got some:
http://www.open-mpi.org/mtt/index.php?do_redir=1030 IU, v1.3.1

Ah, and http://www.open-mpi.org/mtt/index.php?do_redir=1031 IU_Sif,
v1.3 January 4/9700 failures

I'm not sure what to key in on to find these particular errors.

Yeah, worth fixing.