Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] MTT tests: segv's with sm on large messages
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2009-05-06 00:02:41


If it would help in tracking this problem to give someone access to
Sif, I can probably make that happen. Just let me know.

Cheers,
Josh

On May 5, 2009, at 8:08 PM, Eugene Loh wrote:

> Jeff Squyres wrote:
>
>> On May 5, 2009, at 6:01 PM, Eugene Loh wrote:
>>
>>> You and Terry saw something that was occurring about 0.01% of the
>>> time
>>> during MPI_Init during add_procs. That does not seem to be what
>>> we are
>>> seeing here.
>>
>> Right -- that's what I'm saying. It's different than the
>> MPI_INIT errors.
>
> I was trying to say that there are two kinds of MPI_Init errors.
> One, which you and Terry have seen, is in add_procs and shows up
> about 0.01% of the time. The other, um, is not and occurs more
> like 1% of the time. I'm not real sure what "1%" means. It isn't
> always 1%. But the times I've seen it has been in MTT runs in
> which there are dozens of failures among thousands of runs.
>
>>> But we have seen failures in 1.3.1 and 1.3.2 that look like the one
>>> here. They occur more like 1% of the time and can occur during
>>> MPI_Init
>>> *OR* later during a collective call. What we're looking at here
>>> seems
>>> to be related. E.g., see
>>> http://www.open-mpi.org/community/lists/devel/2009/03/5768.php
>>
>> Good to see that we're agreeing.
>>
>> Yes, I agree that this is not a new error, but it is worth
>> fixing. Cisco's MTT didn't run last night because there was no
>> new trunk tarball last night. I'll check Cisco's MTT tomorrow
>> morning and see if there are any sm failures of this new flavor,
>> and how frequently they're happening.
>
> I just took a stroll down memory lane and these errors seem to be
> harder to find than I thought. But, got some: http://www.open-
> mpi.org/mtt/index.php?do_redir=1030 IU, v1.3.1
>
> Ah, and http://www.open-mpi.org/mtt/index.php?do_redir=1031
> IU_Sif, v1.3 January 4/9700 failures
>
> I'm not sure what to key in on to find these particular errors.
>
> Yeah, worth fixing.
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel