If this is what I think it is, try using this MCA parameter:
-mca btl_openib_ib_timeout 20
If this fixes it -- I don't fully understand what's going on, but it's
an issue in the IB fabrics itself. Someone else might be able to
explain in more detail..
Brian Dobbins wrote:
> Hi Brock
>> We have a user whos code keep failing at a similar point in the
>> code. The errors (below) would make me think its a fabric problem,
>> but ibcheckerrors is not returning any issues. He is using
>> openmpi-1.2.0 With OFED on RHEL4,
> Strangely enough, I hit this exact problem about half an hour ago...
> what compilers is he using for the code / OpenMPI? I haven't narrowed
> down the cause yet because the system I'm on is a tad, uh, disheveled,
> but it'd be good to find any commonality. I'm using PGI-7.1-2
> (pgf77/pgf90) with OpenMPI-1.2.4. The system also happens to be RHEL 4
> (Update 3).
> .. Also, the code I'm running is CCSM, and it gave an error message
> about being unable to read a file correctly right before my
> synchronization. This code has worked on other systems in the past
> (non-IB, non-IBRIX), but something as basic as a file write shouldn't be
> adversely affected by such things, hence I'm going to try backing the
> compiler down to a 'known-good' one first., since perhaps that's my
> problem. I don't suppose you saw any messages of that sort? I did
> already try setting the retry count parameter up to 20 (from 7), but
> that didn't fix it.
> - Brian
> Brian Dobbins
> Yale University HPC
> users mailing list