Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] OMPI 1.3 branch
From: Terry Dontje (Terry.Dontje_at_[hidden])
Date: 2009-05-14 12:47:58

Ralph Castain wrote:
> Hi folks
> I encourage people to please look at your MTT outputs. As we are
> preparing to roll the 1.3.3 release, I am seeing a lot of problems on
> the branch:
> 1. timeouts, coming in two forms: (a) MPI_Abort hanging, and (b)
> collectives hanging (this is mostly on Solaris)
Can you clarify or send me a link that makes you believe b is mostly
solaris. Looking at last night's Sun's MTT 1.3 nightly runs I see 47
timeouts on Linux and 24 timeouts on Solaris. That doesn't constitute
mostly Solaris to me. Also how are you determining these timeouts are
Collective based? I have a theory they are but I don't have a clear
smoking gun as of yet.

I've been looking at some collective hangs and segv's. These seem to
happen across different platform and OS (Linux and Solaris). I've been
finding it really hard to reproduce. I ran MPI_Allreduce_loc_c on a
three clusters for 2 days without a hang or segv. I am really concerned
whether we'll even be able to get this to fail with debugging on.

I have not been able to get a core or time with a hung run in order to
get more information.
> 2. segfaults - mostly on sif, but occasionally elsewhere
> 3. daemon failed to report back - this was only on sif
> We will need to correct many of these for the release - unless it
> proves to be due to trivial errors, I don't see how we will be ready
> to roll release candidates next week.
> So let's please start taking a look at these?!
I've actually been looking at ours though I have not been extremely
vocal. I was hoping to get more info on our timeouts before requesting
> Ralph
> ------------------------------------------------------------------------
> _______________________________________________
> devel mailing list
> devel_at_[hidden]