Oops. One key typo here: This is the IMB-MPI1 gather test, not
On 9/16/2010 12:05 PM, Steve Wise wrote:
> I'm debugging a performance problem with running IMB-MP1/barrier in an
> NP64 cluster (8 nodes, 8 cores each). I'm using openmpi-1.4.1 from
> the OFED-1.5.1 distribution. The BTL is openib/iWARP via Chelsio's T3
> RNIC. In short, a NP60 and smaller run completes in a timely manner
> as expected, but NP61 and larger runs come to a crawl at the 8KB IO
> size and take ~5-10min to complete. It does complete though. It
> behaves this way even if I run on > 8 nodes so there are available
> cores. IE a NP64 on a 16 node cluster still behaves the same way even
> though there are only 4 ranks on each node. So its apparently not a
> thread starvation issue due to lack of cores. When in the stalled
> state, I see on the order of 100 or so established iwarp connections
> on each node. And the connection count increases VERY slowly and
> sporadically (at its peak there are around 800 connections for a NP64
> gather operation). In comparison, when I run the <= NP60 runs, the
> connections quickly ramp up to the expected amount. I added hooks in
> the openib BTL to track the time it takes to setup each connection.
> In all runs, both <= NP60 and > NP60, the average connection setup
> time is around 200ms. And the max setup time seen is never much above
> this value. That tells me that its not individual connection setup
> that is the issue. I then added printfs/fflushes in librdmacm to
> visually see when a connection is attempted and when it is accepted.
> When I run with these printfs, I see the connections get setup quickly
> and evently in the <= NP60 case. Initially when the job is started, I
> see a small flurry of connections getting setup, then the run begins
> and at around 1KB IO size I see a 2nd large flurry of connection
> setups. Then the test continues and completes. With the >NP60 case,
> this second round of connection setups is very sporadic and slow.
> Very slow! I'll see little bursts of ~10-20 connections setup, then
> long random pauses. The net is that full connection setup for the job
> takes 5-10min. During this time the ranks are basically spinning idle
> awaiting the connections to get setup. So I'm concluding that
> something above the BTL layer isn't issuing the endpoint connect
> requests in a timely manner.
> Attached are 3 padb dumps during the stall. Anybody see anything
> interesting in these?
> Any ideas how I can further debug this? Once I get above the openib
> BTL layer my eyes glaze over and I get lost quickly. :) I would
> greatly appreciate any ideas from the OpenMPI experts!
> Thanks in advance,
> devel mailing list