Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] NP64 _gather_ problem
From: Steve Wise (swise_at_[hidden])
Date: 2010-09-16 17:05:16

  Here's a clue: ompi_coll_tuned_gather_intra_dec_fixed() changes its
algorithm for job sizes > 60 to some binomial method. I changed the
threshold to 100 and my NP64 jobs run fine. Now to try and understand
what about ompi_coll_tuned_gather_intra_binomial() is causing these
connect delays...

On 9/16/2010 1:01 PM, Steve Wise wrote:
> Oops. One key typo here: This is the IMB-MPI1 gather test, not
> barrier. :(
> On 9/16/2010 12:05 PM, Steve Wise wrote:
>> Hi,
>> I'm debugging a performance problem with running IMB-MP1/barrier in
>> an NP64 cluster (8 nodes, 8 cores each). I'm using openmpi-1.4.1
>> from the OFED-1.5.1 distribution. The BTL is openib/iWARP via
>> Chelsio's T3 RNIC. In short, a NP60 and smaller run completes in a
>> timely manner as expected, but NP61 and larger runs come to a crawl
>> at the 8KB IO size and take ~5-10min to complete. It does complete
>> though. It behaves this way even if I run on > 8 nodes so there are
>> available cores. IE a NP64 on a 16 node cluster still behaves the
>> same way even though there are only 4 ranks on each node. So its
>> apparently not a thread starvation issue due to lack of cores. When
>> in the stalled state, I see on the order of 100 or so established
>> iwarp connections on each node. And the connection count increases
>> VERY slowly and sporadically (at its peak there are around 800
>> connections for a NP64 gather operation). In comparison, when I run
>> the <= NP60 runs, the connections quickly ramp up to the expected
>> amount. I added hooks in the openib BTL to track the time it takes
>> to setup each connection. In all runs, both <= NP60 and > NP60, the
>> average connection setup time is around 200ms. And the max setup
>> time seen is never much above this value. That tells me that its not
>> individual connection setup that is the issue. I then added
>> printfs/fflushes in librdmacm to visually see when a connection is
>> attempted and when it is accepted. When I run with these printfs, I
>> see the connections get setup quickly and evently in the <= NP60
>> case. Initially when the job is started, I see a small flurry of
>> connections getting setup, then the run begins and at around 1KB IO
>> size I see a 2nd large flurry of connection setups. Then the test
>> continues and completes. With the >NP60 case, this second round of
>> connection setups is very sporadic and slow. Very slow! I'll see
>> little bursts of ~10-20 connections setup, then long random pauses.
>> The net is that full connection setup for the job takes 5-10min.
>> During this time the ranks are basically spinning idle awaiting the
>> connections to get setup. So I'm concluding that something above the
>> BTL layer isn't issuing the endpoint connect requests in a timely
>> manner.
>> Attached are 3 padb dumps during the stall. Anybody see anything
>> interesting in these?
>> Any ideas how I can further debug this? Once I get above the openib
>> BTL layer my eyes glaze over and I get lost quickly. :) I would
>> greatly appreciate any ideas from the OpenMPI experts!
>> Thanks in advance,
>> Steve.
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
> _______________________________________________
> devel mailing list
> devel_at_[hidden]