Jeff Squyres wrote:
> On May 5, 2009, at 6:01 PM, Eugene Loh wrote:
>
>> You and Terry saw something that was occurring about 0.01% of the time
>> during MPI_Init during add_procs. That does not seem to be what we are
>> seeing here.
>
> Right -- that's what I'm saying. It's different than the MPI_INIT
> errors.
I was trying to say that there are two kinds of MPI_Init errors. One,
which you and Terry have seen, is in add_procs and shows up about 0.01%
of the time. The other, um, is not and occurs more like 1% of the
time. I'm not real sure what "1%" means. It isn't always 1%. But the
times I've seen it has been in MTT runs in which there are dozens of
failures among thousands of runs.
>> But we have seen failures in 1.3.1 and 1.3.2 that look like the one
>> here. They occur more like 1% of the time and can occur during
>> MPI_Init
>> *OR* later during a collective call. What we're looking at here seems
>> to be related. E.g., see
>> http://www.open-mpi.org/community/lists/devel/2009/03/5768.php
>
> Good to see that we're agreeing.
>
> Yes, I agree that this is not a new error, but it is worth fixing.
> Cisco's MTT didn't run last night because there was no new trunk
> tarball last night. I'll check Cisco's MTT tomorrow morning and see
> if there are any sm failures of this new flavor, and how frequently
> they're happening.
I just took a stroll down memory lane and these errors seem to be harder
to find than I thought. But, got some:
http://www.open-mpi.org/mtt/index.php?do_redir=1030 IU, v1.3.1
Ah, and http://www.open-mpi.org/mtt/index.php?do_redir=1031 IU_Sif,
v1.3 January 4/9700 failures
I'm not sure what to key in on to find these particular errors.
Yeah, worth fixing.
|