Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] NP64 _gather_ problem
From: Steve Wise (swise_at_[hidden])
Date: 2010-09-17 12:02:22

  Does anyone have a NP64 IB cluster handy? I'd be interested if IB
behaves this way when running with the rdmacm connect method. IE with:

  --mca btl_openib_cpc_include rdmacm --mca btl openib,sm,self


On 9/17/2010 10:41 AM, Steve Wise wrote:
> Yes it does. With mpi_preconnect_mpi to 1, NP64 doesn't stall. So
> its not the algorithm in and of itself, but rather some interplay
> between the algorithm and connection setup I guess.
> On 9/17/2010 5:24 AM, Terry Dontje wrote:
>> Does setting mca parameter mpi_preconnect_mpi to 1 help at all. This
>> might be able to help determine if it is the actually connection set
>> up between processes that are out of sync as oppose to something in
>> the actual gather algorithm.
>> --td
>> Steve Wise wrote:
>>> Here's a clue: ompi_coll_tuned_gather_intra_dec_fixed() changes its
>>> algorithm for job sizes > 60 to some binomial method. I changed the
>>> threshold to 100 and my NP64 jobs run fine. Now to try and
>>> understand what about ompi_coll_tuned_gather_intra_binomial() is
>>> causing these connect delays...
>>> On 9/16/2010 1:01 PM, Steve Wise wrote:
>>>> Oops. One key typo here: This is the IMB-MPI1 gather test, not
>>>> barrier. :(
>>>> On 9/16/2010 12:05 PM, Steve Wise wrote:
>>>>> Hi,
>>>>> I'm debugging a performance problem with running IMB-MP1/barrier
>>>>> in an NP64 cluster (8 nodes, 8 cores each). I'm using
>>>>> openmpi-1.4.1 from the OFED-1.5.1 distribution. The BTL is
>>>>> openib/iWARP via Chelsio's T3 RNIC. In short, a NP60 and smaller
>>>>> run completes in a timely manner as expected, but NP61 and larger
>>>>> runs come to a crawl at the 8KB IO size and take ~5-10min to
>>>>> complete. It does complete though. It behaves this way even if I
>>>>> run on > 8 nodes so there are available cores. IE a NP64 on a 16
>>>>> node cluster still behaves the same way even though there are only
>>>>> 4 ranks on each node. So its apparently not a thread starvation
>>>>> issue due to lack of cores. When in the stalled state, I see on
>>>>> the order of 100 or so established iwarp connections on each
>>>>> node. And the connection count increases VERY slowly and
>>>>> sporadically (at its peak there are around 800 connections for a
>>>>> NP64 gather operation). In comparison, when I run the <= NP60
>>>>> runs, the connections quickly ramp up to the expected amount. I
>>>>> added hooks in the openib BTL to track the time it takes to setup
>>>>> each connection. In all runs, both <= NP60 and > NP60, the
>>>>> average connection setup time is around 200ms. And the max setup
>>>>> time seen is never much above this value. That tells me that its
>>>>> not individual connection setup that is the issue. I then added
>>>>> printfs/fflushes in librdmacm to visually see when a connection is
>>>>> attempted and when it is accepted. When I run with these printfs,
>>>>> I see the connections get setup quickly and evently in the <= NP60
>>>>> case. Initially when the job is started, I see a small flurry of
>>>>> connections getting setup, then the run begins and at around 1KB
>>>>> IO size I see a 2nd large flurry of connection setups. Then the
>>>>> test continues and completes. With the >NP60 case, this second
>>>>> round of connection setups is very sporadic and slow. Very slow!
>>>>> I'll see little bursts of ~10-20 connections setup, then long
>>>>> random pauses. The net is that full connection setup for the job
>>>>> takes 5-10min. During this time the ranks are basically spinning
>>>>> idle awaiting the connections to get setup. So I'm concluding
>>>>> that something above the BTL layer isn't issuing the endpoint
>>>>> connect requests in a timely manner.
>>>>> Attached are 3 padb dumps during the stall. Anybody see anything
>>>>> interesting in these?
>>>>> Any ideas how I can further debug this? Once I get above the
>>>>> openib BTL layer my eyes glaze over and I get lost quickly. :) I
>>>>> would greatly appreciate any ideas from the OpenMPI experts!
>>>>> Thanks in advance,
>>>>> Steve.
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>> ------------------------------------------------------------------------
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>> --
>> Oracle
>> Terry D. Dontje | Principal Software Engineer
>> Developer Tools Engineering | +1.781.442.2631
>> Oracle * - Performance Technologies*
>> 95 Network Drive, Burlington, MA 01803
>> Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
> _______________________________________________
> devel mailing list
> devel_at_[hidden]