I'm debugging a performance problem with running IMB-MP1/barrier in an
NP64 cluster (8 nodes, 8 cores each). I'm using openmpi-1.4.1 from the
OFED-1.5.1 distribution. The BTL is openib/iWARP via Chelsio's T3
RNIC. In short, a NP60 and smaller run completes in a timely manner as
expected, but NP61 and larger runs come to a crawl at the 8KB IO size
and take ~5-10min to complete. It does complete though. It behaves
this way even if I run on > 8 nodes so there are available cores. IE a
NP64 on a 16 node cluster still behaves the same way even though there
are only 4 ranks on each node. So its apparently not a thread
starvation issue due to lack of cores. When in the stalled state, I see
on the order of 100 or so established iwarp connections on each node.
And the connection count increases VERY slowly and sporadically (at its
peak there are around 800 connections for a NP64 gather operation). In
comparison, when I run the <= NP60 runs, the connections quickly ramp up
to the expected amount. I added hooks in the openib BTL to track the
time it takes to setup each connection. In all runs, both <= NP60 and >
NP60, the average connection setup time is around 200ms. And the max
setup time seen is never much above this value. That tells me that its
not individual connection setup that is the issue. I then added
printfs/fflushes in librdmacm to visually see when a connection is
attempted and when it is accepted. When I run with these printfs, I see
the connections get setup quickly and evently in the <= NP60 case.
Initially when the job is started, I see a small flurry of connections
getting setup, then the run begins and at around 1KB IO size I see a 2nd
large flurry of connection setups. Then the test continues and
completes. With the >NP60 case, this second round of connection setups
is very sporadic and slow. Very slow! I'll see little bursts of ~10-20
connections setup, then long random pauses. The net is that full
connection setup for the job takes 5-10min. During this time the ranks
are basically spinning idle awaiting the connections to get setup. So
I'm concluding that something above the BTL layer isn't issuing the
endpoint connect requests in a timely manner.
Attached are 3 padb dumps during the stall. Anybody see anything
interesting in these?
Any ideas how I can further debug this? Once I get above the openib
BTL layer my eyes glaze over and I get lost quickly. :) I would greatly
appreciate any ideas from the OpenMPI experts!
Thanks in advance,