If it would help in tracking this problem to give someone access to
Sif, I can probably make that happen. Just let me know.
On May 5, 2009, at 8:08 PM, Eugene Loh wrote:
> Jeff Squyres wrote:
>> On May 5, 2009, at 6:01 PM, Eugene Loh wrote:
>>> You and Terry saw something that was occurring about 0.01% of the
>>> during MPI_Init during add_procs. That does not seem to be what
>>> we are
>>> seeing here.
>> Right -- that's what I'm saying. It's different than the
>> MPI_INIT errors.
> I was trying to say that there are two kinds of MPI_Init errors.
> One, which you and Terry have seen, is in add_procs and shows up
> about 0.01% of the time. The other, um, is not and occurs more
> like 1% of the time. I'm not real sure what "1%" means. It isn't
> always 1%. But the times I've seen it has been in MTT runs in
> which there are dozens of failures among thousands of runs.
>>> But we have seen failures in 1.3.1 and 1.3.2 that look like the one
>>> here. They occur more like 1% of the time and can occur during
>>> *OR* later during a collective call. What we're looking at here
>>> to be related. E.g., see
>> Good to see that we're agreeing.
>> Yes, I agree that this is not a new error, but it is worth
>> fixing. Cisco's MTT didn't run last night because there was no
>> new trunk tarball last night. I'll check Cisco's MTT tomorrow
>> morning and see if there are any sm failures of this new flavor,
>> and how frequently they're happening.
> I just took a stroll down memory lane and these errors seem to be
> harder to find than I thought. But, got some: http://www.open-
> mpi.org/mtt/index.php?do_redir=1030 IU, v1.3.1
> Ah, and http://www.open-mpi.org/mtt/index.php?do_redir=1031
> IU_Sif, v1.3 January 4/9700 failures
> I'm not sure what to key in on to find these particular errors.
> Yeah, worth fixing.
> devel mailing list