On May 5, 2009, at 6:01 PM, Eugene Loh wrote:
> You and Terry saw something that was occurring about 0.01% of the time
> during MPI_Init during add_procs. That does not seem to be what we
> are
> seeing here.
>
Right -- that's what I'm saying. It's different than the MPI_INIT
errors.
> But we have seen failures in 1.3.1 and 1.3.2 that look like the one
> here. They occur more like 1% of the time and can occur during
> MPI_Init
> *OR* later during a collective call. What we're looking at here seems
> to be related. E.g., see
> http://www.open-mpi.org/community/lists/devel/2009/03/5768.php
>
Good to see that we're agreeing.
Yes, I agree that this is not a new error, but it is worth fixing.
Cisco's MTT didn't run last night because there was no new trunk
tarball last night. I'll check Cisco's MTT tomorrow morning and see
if there are any sm failures of this new flavor, and how frequently
they're happening.
--
Jeff Squyres
Cisco Systems
|