Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] NP64 _gather_ problem
From: Steve Wise (swise_at_[hidden])
Date: 2010-09-17 12:02:22


  Does anyone have a NP64 IB cluster handy? I'd be interested if IB
behaves this way when running with the rdmacm connect method. IE with:

  --mca btl_openib_cpc_include rdmacm --mca btl openib,sm,self

Steve.

On 9/17/2010 10:41 AM, Steve Wise wrote:
> Yes it does. With mpi_preconnect_mpi to 1, NP64 doesn't stall. So
> its not the algorithm in and of itself, but rather some interplay
> between the algorithm and connection setup I guess.
>
>
> On 9/17/2010 5:24 AM, Terry Dontje wrote:
>> Does setting mca parameter mpi_preconnect_mpi to 1 help at all. This
>> might be able to help determine if it is the actually connection set
>> up between processes that are out of sync as oppose to something in
>> the actual gather algorithm.
>>
>> --td
>>
>> Steve Wise wrote:
>>> Here's a clue: ompi_coll_tuned_gather_intra_dec_fixed() changes its
>>> algorithm for job sizes > 60 to some binomial method. I changed the
>>> threshold to 100 and my NP64 jobs run fine. Now to try and
>>> understand what about ompi_coll_tuned_gather_intra_binomial() is
>>> causing these connect delays...
>>>
>>>
>>> On 9/16/2010 1:01 PM, Steve Wise wrote:
>>>> Oops. One key typo here: This is the IMB-MPI1 gather test, not
>>>> barrier. :(
>>>>
>>>>
>>>> On 9/16/2010 12:05 PM, Steve Wise wrote:
>>>>> Hi,
>>>>>
>>>>> I'm debugging a performance problem with running IMB-MP1/barrier
>>>>> in an NP64 cluster (8 nodes, 8 cores each). I'm using
>>>>> openmpi-1.4.1 from the OFED-1.5.1 distribution. The BTL is
>>>>> openib/iWARP via Chelsio's T3 RNIC. In short, a NP60 and smaller
>>>>> run completes in a timely manner as expected, but NP61 and larger
>>>>> runs come to a crawl at the 8KB IO size and take ~5-10min to
>>>>> complete. It does complete though. It behaves this way even if I
>>>>> run on > 8 nodes so there are available cores. IE a NP64 on a 16
>>>>> node cluster still behaves the same way even though there are only
>>>>> 4 ranks on each node. So its apparently not a thread starvation
>>>>> issue due to lack of cores. When in the stalled state, I see on
>>>>> the order of 100 or so established iwarp connections on each
>>>>> node. And the connection count increases VERY slowly and
>>>>> sporadically (at its peak there are around 800 connections for a
>>>>> NP64 gather operation). In comparison, when I run the <= NP60
>>>>> runs, the connections quickly ramp up to the expected amount. I
>>>>> added hooks in the openib BTL to track the time it takes to setup
>>>>> each connection. In all runs, both <= NP60 and > NP60, the
>>>>> average connection setup time is around 200ms. And the max setup
>>>>> time seen is never much above this value. That tells me that its
>>>>> not individual connection setup that is the issue. I then added
>>>>> printfs/fflushes in librdmacm to visually see when a connection is
>>>>> attempted and when it is accepted. When I run with these printfs,
>>>>> I see the connections get setup quickly and evently in the <= NP60
>>>>> case. Initially when the job is started, I see a small flurry of
>>>>> connections getting setup, then the run begins and at around 1KB
>>>>> IO size I see a 2nd large flurry of connection setups. Then the
>>>>> test continues and completes. With the >NP60 case, this second
>>>>> round of connection setups is very sporadic and slow. Very slow!
>>>>> I'll see little bursts of ~10-20 connections setup, then long
>>>>> random pauses. The net is that full connection setup for the job
>>>>> takes 5-10min. During this time the ranks are basically spinning
>>>>> idle awaiting the connections to get setup. So I'm concluding
>>>>> that something above the BTL layer isn't issuing the endpoint
>>>>> connect requests in a timely manner.
>>>>>
>>>>> Attached are 3 padb dumps during the stall. Anybody see anything
>>>>> interesting in these?
>>>>>
>>>>> Any ideas how I can further debug this? Once I get above the
>>>>> openib BTL layer my eyes glaze over and I get lost quickly. :) I
>>>>> would greatly appreciate any ideas from the OpenMPI experts!
>>>>>
>>>>>
>>>>> Thanks in advance,
>>>>>
>>>>> Steve.
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> --
>> Oracle
>> Terry D. Dontje | Principal Software Engineer
>> Developer Tools Engineering | +1.781.442.2631
>> Oracle * - Performance Technologies*
>> 95 Network Drive, Burlington, MA 01803
>> Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



picture