Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] OMPI 1.3 branch
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-05-14 13:09:41

On Thu, May 14, 2009 at 10:47 AM, Terry Dontje <Terry.Dontje_at_[hidden]> wrote:

> Ralph Castain wrote:
>> Hi folks
>> I encourage people to please look at your MTT outputs. As we are preparing
>> to roll the 1.3.3 release, I am seeing a lot of problems on the branch:
>> 1. timeouts, coming in two forms: (a) MPI_Abort hanging, and (b)
>> collectives hanging (this is mostly on Solaris)
>> Can you clarify or send me a link that makes you believe b is mostly
> solaris. Looking at last night's Sun's MTT 1.3 nightly runs I see 47
> timeouts on Linux and 24 timeouts on Solaris. That doesn't constitute
> mostly Solaris to me. Also how are you determining these timeouts are
> Collective based? I have a theory they are but I don't have a clear smoking
> gun as of yet.

I looked at this MTT report, which showed it hanging in a whole bunch of
collective tests:

When I look at the hangs on other systems, they are in non-collective tests.
I'm not sure what that really means, though - it was just an observation
based on this one set of tests.

> I've been looking at some collective hangs and segv's. These seem to
> happen across different platform and OS (Linux and Solaris). I've been
> finding it really hard to reproduce. I ran MPI_Allreduce_loc_c on a three
> clusters for 2 days without a hang or segv. I am really concerned whether
> we'll even be able to get this to fail with debugging on.
> I have not been able to get a core or time with a hung run in order to get
> more information.
>> 2. segfaults - mostly on sif, but occasionally elsewhere
>> 3. daemon failed to report back - this was only on sif
>> We will need to correct many of these for the release - unless it proves
>> to be due to trivial errors, I don't see how we will be ready to roll
>> release candidates next week.
>> So let's please start taking a look at these?!
>> I've actually been looking at ours though I have not been extremely
> vocal. I was hoping to get more info on our timeouts before requesting
> help.

No problem - I wasn't pointing a finger at anyone in particular. Just wanted
to highlight that the branch is not in great shape since we had talked on
the telecon about trying to do a release next week.

> Ralph
>> ------------------------------------------------------------------------
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
> _______________________________________________
> devel mailing list
> devel_at_[hidden]