Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenIB problems
From: Brock Palen (brockp_at_[hidden])
Date: 2007-11-27 10:25:36


On Nov 21, 2007, at 3:39 PM, Andrew Friedley wrote:

> If this is what I think it is, try using this MCA parameter:
>
> -mca btl_openib_ib_timeout 20

The user used this option and it allowed the run to complete.
You say its a issue with the fabric ibshowerrors does not show any
problems.

Its topspin (cisco) gear, nic's, switch,cables.
Should I follow up with cisco more?

Brock

>
> If this fixes it -- I don't fully understand what's going on, but it's
> an issue in the IB fabrics itself. Someone else might be able to
> explain in more detail..
>
> Andrew
>
>
> Brian Dobbins wrote:
>> Hi Brock
>>> We have a user whos code keep failing at a similar point in the
>>> code. The errors (below) would make me think its a fabric problem,
>>> but ibcheckerrors is not returning any issues. He is using
>>> openmpi-1.2.0 With OFED on RHEL4,
>>>
>> Strangely enough, I hit this exact problem about half an hour
>> ago...
>> what compilers is he using for the code / OpenMPI? I haven't
>> narrowed
>> down the cause yet because the system I'm on is a tad, uh,
>> disheveled,
>> but it'd be good to find any commonality. I'm using PGI-7.1-2
>> (pgf77/pgf90) with OpenMPI-1.2.4. The system also happens to be
>> RHEL 4
>> (Update 3).
>>
>> .. Also, the code I'm running is CCSM, and it gave an error message
>> about being unable to read a file correctly right before my
>> synchronization. This code has worked on other systems in the past
>> (non-IB, non-IBRIX), but something as basic as a file write
>> shouldn't be
>> adversely affected by such things, hence I'm going to try backing the
>> compiler down to a 'known-good' one first., since perhaps that's my
>> problem. I don't suppose you saw any messages of that sort? I did
>> already try setting the retry count parameter up to 20 (from 7), but
>> that didn't fix it.
>>
>> Cheers,
>> - Brian
>>
>>
>> Brian Dobbins
>> Yale University HPC
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>