Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenIB problems
From: Andrew Friedley (afriedle_at_[hidden])
Date: 2007-11-27 10:49:46


Brock Palen wrote:
> On Nov 21, 2007, at 3:39 PM, Andrew Friedley wrote:
>
>> If this is what I think it is, try using this MCA parameter:
>>
>> -mca btl_openib_ib_timeout 20
>
> The user used this option and it allowed the run to complete.
> You say its a issue with the fabric ibshowerrors does not show any
> problems.
>
> Its topspin (cisco) gear, nic's, switch,cables.
> Should I follow up with cisco more?

Sure why not, if you think it'd be useful. FWIW, I see this on
Voltaire/Mellanox hardware with Open MPI; others here at LLNL tell me
they've seen it with MVAPICH as well.

Andrew

> Brock
>
>> If this fixes it -- I don't fully understand what's going on, but it's
>> an issue in the IB fabrics itself. Someone else might be able to
>> explain in more detail..
>>
>> Andrew
>>
>>
>> Brian Dobbins wrote:
>>> Hi Brock
>>>> We have a user whos code keep failing at a similar point in the
>>>> code. The errors (below) would make me think its a fabric problem,
>>>> but ibcheckerrors is not returning any issues. He is using
>>>> openmpi-1.2.0 With OFED on RHEL4,
>>>>
>>> Strangely enough, I hit this exact problem about half an hour
>>> ago...
>>> what compilers is he using for the code / OpenMPI? I haven't
>>> narrowed
>>> down the cause yet because the system I'm on is a tad, uh,
>>> disheveled,
>>> but it'd be good to find any commonality. I'm using PGI-7.1-2
>>> (pgf77/pgf90) with OpenMPI-1.2.4. The system also happens to be
>>> RHEL 4
>>> (Update 3).
>>>
>>> .. Also, the code I'm running is CCSM, and it gave an error message
>>> about being unable to read a file correctly right before my
>>> synchronization. This code has worked on other systems in the past
>>> (non-IB, non-IBRIX), but something as basic as a file write
>>> shouldn't be
>>> adversely affected by such things, hence I'm going to try backing the
>>> compiler down to a 'known-good' one first., since perhaps that's my
>>> problem. I don't suppose you saw any messages of that sort? I did
>>> already try setting the retry count parameter up to 20 (from 7), but
>>> that didn't fix it.
>>>
>>> Cheers,
>>> - Brian
>>>
>>>
>>> Brian Dobbins
>>> Yale University HPC
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users