Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] OpenIB problems
From: Andrew Friedley (afriedle_at_[hidden])
Date: 2007-11-27 10:49:46

Brock Palen wrote:
> On Nov 21, 2007, at 3:39 PM, Andrew Friedley wrote:
>> If this is what I think it is, try using this MCA parameter:
>> -mca btl_openib_ib_timeout 20
> The user used this option and it allowed the run to complete.
> You say its a issue with the fabric ibshowerrors does not show any
> problems.
> Its topspin (cisco) gear, nic's, switch,cables.
> Should I follow up with cisco more?

Sure why not, if you think it'd be useful. FWIW, I see this on
Voltaire/Mellanox hardware with Open MPI; others here at LLNL tell me
they've seen it with MVAPICH as well.


> Brock
>> If this fixes it -- I don't fully understand what's going on, but it's
>> an issue in the IB fabrics itself. Someone else might be able to
>> explain in more detail..
>> Andrew
>> Brian Dobbins wrote:
>>> Hi Brock
>>>> We have a user whos code keep failing at a similar point in the
>>>> code. The errors (below) would make me think its a fabric problem,
>>>> but ibcheckerrors is not returning any issues. He is using
>>>> openmpi-1.2.0 With OFED on RHEL4,
>>> Strangely enough, I hit this exact problem about half an hour
>>> ago...
>>> what compilers is he using for the code / OpenMPI? I haven't
>>> narrowed
>>> down the cause yet because the system I'm on is a tad, uh,
>>> disheveled,
>>> but it'd be good to find any commonality. I'm using PGI-7.1-2
>>> (pgf77/pgf90) with OpenMPI-1.2.4. The system also happens to be
>>> RHEL 4
>>> (Update 3).
>>> .. Also, the code I'm running is CCSM, and it gave an error message
>>> about being unable to read a file correctly right before my
>>> synchronization. This code has worked on other systems in the past
>>> (non-IB, non-IBRIX), but something as basic as a file write
>>> shouldn't be
>>> adversely affected by such things, hence I'm going to try backing the
>>> compiler down to a 'known-good' one first., since perhaps that's my
>>> problem. I don't suppose you saw any messages of that sort? I did
>>> already try setting the retry count parameter up to 20 (from 7), but
>>> that didn't fix it.
>>> Cheers,
>>> - Brian
>>> Brian Dobbins
>>> Yale University HPC
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
> _______________________________________________
> users mailing list
> users_at_[hidden]