Jeff Squyres wrote:
> On Mar 31, 2009, at 3:06 PM, Eugene Loh wrote:
>> The thing I was wondering about was memory barriers. E.g., you
>> initialize stuff and then post the FIFO pointer. The other guy sees
>> FIFO pointer before the initialized memory.
> We do do memory barriers during that SM startup sequence. I haven't
> checked in a while, but I thought we were doing the right kinds of
> barriers in the right order...
There are certainly *some* barriers. The particular scenario I asked
about didn't seem protected against (IMHO), but I certainly don't
understand these issues and remain cautious about any guesses I make
until I can demonstrate the problem and a solution.
Regarding "demonstrating the problem", I see the Sun MTT logs show some
number of Init errors without mca_coll_hierarch involved. I'll try
rerunning on the same machines and see if I can trigger the problem.
> But George mentioned on the call today that they may have found the
> issue, but they're testing it. He didn't explain what the issue was
> in case he was wrong. ;-)