Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] SM init failures
From: Eugene Loh (Eugene.Loh_at_[hidden])
Date: 2009-03-31 19:10:43

Jeff Squyres wrote:

> On Mar 31, 2009, at 3:06 PM, Eugene Loh wrote:
>> The thing I was wondering about was memory barriers. E.g., you
>> initialize stuff and then post the FIFO pointer. The other guy sees
>> the
>> FIFO pointer before the initialized memory.
> We do do memory barriers during that SM startup sequence. I haven't
> checked in a while, but I thought we were doing the right kinds of
> barriers in the right order...

There are certainly *some* barriers. The particular scenario I asked
about didn't seem protected against (IMHO), but I certainly don't
understand these issues and remain cautious about any guesses I make
until I can demonstrate the problem and a solution.

Regarding "demonstrating the problem", I see the Sun MTT logs show some
number of Init errors without mca_coll_hierarch involved. I'll try
rerunning on the same machines and see if I can trigger the problem.

> But George mentioned on the call today that they may have found the
> issue, but they're testing it. He didn't explain what the issue was
> in case he was wrong. ;-)