Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenIB problems
From: Brock Palen (brockp_at_[hidden])
Date: 2007-11-27 15:11:19


On Nov 27, 2007, at 10:49 AM, Andrew Friedley wrote:

>
>
> Brock Palen wrote:
>> On Nov 21, 2007, at 3:39 PM, Andrew Friedley wrote:
>>
>>> If this is what I think it is, try using this MCA parameter:
>>>
>>> -mca btl_openib_ib_timeout 20
>>
>> The user used this option and it allowed the run to complete.
>> You say its a issue with the fabric ibshowerrors does not show any
>> problems.
>>
>> Its topspin (cisco) gear, nic's, switch,cables.
>> Should I follow up with cisco more?
>
> Sure why not, if you think it'd be useful. FWIW, I see this on
> Voltaire/Mellanox hardware with Open MPI; others here at LLNL tell me
> they've seen it with MVAPICH as well.

What would be a place to look? Should this just be default then for
OMPI? ompi_info shows the default as 10 seconds? Is that right
'seconds' ?

>
> Andrew
>
>> Brock
>>
>>> If this fixes it -- I don't fully understand what's going on, but
>>> it's
>>> an issue in the IB fabrics itself. Someone else might be able to
>>> explain in more detail..
>>>
>>> Andrew
>>>
>>>
>>> Brian Dobbins wrote:
>>>> Hi Brock
>>>>> We have a user whos code keep failing at a similar point in the
>>>>> code. The errors (below) would make me think its a fabric
>>>>> problem,
>>>>> but ibcheckerrors is not returning any issues. He is using
>>>>> openmpi-1.2.0 With OFED on RHEL4,
>>>>>
>>>> Strangely enough, I hit this exact problem about half an hour
>>>> ago...
>>>> what compilers is he using for the code / OpenMPI? I haven't
>>>> narrowed
>>>> down the cause yet because the system I'm on is a tad, uh,
>>>> disheveled,
>>>> but it'd be good to find any commonality. I'm using PGI-7.1-2
>>>> (pgf77/pgf90) with OpenMPI-1.2.4. The system also happens to be
>>>> RHEL 4
>>>> (Update 3).
>>>>
>>>> .. Also, the code I'm running is CCSM, and it gave an error
>>>> message
>>>> about being unable to read a file correctly right before my
>>>> synchronization. This code has worked on other systems in the past
>>>> (non-IB, non-IBRIX), but something as basic as a file write
>>>> shouldn't be
>>>> adversely affected by such things, hence I'm going to try
>>>> backing the
>>>> compiler down to a 'known-good' one first., since perhaps that's my
>>>> problem. I don't suppose you saw any messages of that sort? I
>>>> did
>>>> already try setting the retry count parameter up to 20 (from 7),
>>>> but
>>>> that didn't fix it.
>>>>
>>>> Cheers,
>>>> - Brian
>>>>
>>>>
>>>> Brian Dobbins
>>>> Yale University HPC
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>