Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] SM init failures
From: Eugene Loh (Eugene.Loh_at_[hidden])
Date: 2009-03-27 10:15:38


Josh Hursey wrote:

> Sif is also running the coll_hierarch component on some of those
> tests which has caused some additional problems. I don't know if that
> is related or not.

Indeed. Many of the MTT stack traces (for both 1.3.1 and 1.3.2 and that
have seg faults and call out mca_btl_sm.so) do involve collectives
and/or have mca_coll_hierarch.so in their stack traces. I could well
imagine this is the culprit, though I do not know for sure.

Ralph Castain wrote:

> Hmmm...Eugene, you need to be a tad less sensitive. Nobody was
> attempting to indict you or in any way attack you or your code.

Yes, I know, though thank you for saying so. I was overdoing the
defensive rhetoric trying to be funny, but I confess it's nervous
humor. There was stuff in the sm code that I couldn't see how it was
100% robust. Nevertheless, I let that style remain in the code with my
changes... perhaps even pushing it a little bit. My putbacks include a
comment or two to that effect. E.g.,
https://svn.open-mpi.org/source/xref/ompi-trunk/ompi/mca/btl/sm/btl_sm.c?r=20774#523
. I understand why the occasional failures that Jeff and Terry saw did
not hold up 1.3.1, but I'd really like to understand them and fix them.
But at 0.01% fail rate (<0.001% for me... I've never seen it in 100Ks of
runs), all I can do about etiology and fixes is speculate.

Okay, joke overdone and nervousness no longer funny. Indeed, annoying.
I stop.

> Since we clearly see problems on sif, and Josh has indicated a
> willingness to help with debugging, this might be a place to start
> the investigation. If asked nicely, they might even be willing to
> grant access to the machine, if that would help.

Maybe a starting point would be running IU_Sif without coll_hierarch and
seeing where we stand.

And, again, my gut feel is that the failures are unrelated to the 0.01%
failures that Jeff and Terry were seeing.