Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] NP64 _gather_ problem
From: Steve Wise (swise_at_[hidden])
Date: 2010-09-20 13:12:23


Just an update for folks: The connection setup latency was a bug in my
iw_cxgb3 rdma driver. It wasn't turning off RX coalescing for the iwarp
connections. This resulted in 100-200ms added latency since the iwarp
connection setup uses TCP streaming mode messages to negotiate the iwarp
connection mode. Once I fixed this, the gather operation for > NP60
behaves much much better...

Thanks Terry for helping.

Steve.

On 09/17/2010 03:46 PM, Steve Wise wrote:
> I'll look into Solaris Studio. I think somehow the connections are
> getting single threaded or somehow funneled due to the gather
> algorithm. And since they are taking ~160ms to setup each one, and
> there are ~3600 connections getting setup, we end up with a 7 minute
> run time. Now, 160ms seems way too high for setting up even an iWARP
> connection which has some streaming mode TCP exchanges as part of
> connection setup. I would think it should be around a few hundred
> _usecs_. So I'm pursuing the connect latency too.
>
> Thanks,
>
> Steve.
>
> On 9/17/2010 12:13 PM, Terry Dontje wrote:
>> Right, by default all connections will be handled on the fly. So as
>> an MPI_Send is executed to a process that there is not a connection
>> to then a dance happens between the sender and the receiver. So why
>> this happens with np > 60 may have to do with how many connections
>> are happening at the same time or if the destination of one
>> connection request is not in the MPI library.
>>
>> It would be interesting to figure out when in the timeline of the job
>> that such requests are are being delayed. You can get such a
>> timeline by using a tool like Solaris Studio collector/analyzer
>> (which actually has a Linux version).
>>
>> --td
>>
>> Steve Wise wrote:
>>> Yes it does. With mpi_preconnect_mpi to 1, NP64 doesn't stall. So
>>> its not the algorithm in and of itself, but rather some interplay
>>> between the algorithm and connection setup I guess.
>>>
>>>
>>> On 9/17/2010 5:24 AM, Terry Dontje wrote:
>>>> Does setting mca parameter mpi_preconnect_mpi to 1 help at all.
>>>> This might be able to help determine if it is the actually
>>>> connection set up between processes that are out of sync as oppose
>>>> to something in the actual gather algorithm.
>>>>
>>>> --td
>>>>
>>>> Steve Wise wrote:
>>>>> Here's a clue: ompi_coll_tuned_gather_intra_dec_fixed() changes
>>>>> its algorithm for job sizes > 60 to some binomial method. I
>>>>> changed the threshold to 100 and my NP64 jobs run fine. Now to
>>>>> try and understand what about
>>>>> ompi_coll_tuned_gather_intra_binomial() is causing these connect
>>>>> delays...
>>>>>
>>>>>
>>>>> On 9/16/2010 1:01 PM, Steve Wise wrote:
>>>>>> Oops. One key typo here: This is the IMB-MPI1 gather test, not
>>>>>> barrier. :(
>>>>>>
>>>>>>
>>>>>> On 9/16/2010 12:05 PM, Steve Wise wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I'm debugging a performance problem with running IMB-MP1/barrier
>>>>>>> in an NP64 cluster (8 nodes, 8 cores each). I'm using
>>>>>>> openmpi-1.4.1 from the OFED-1.5.1 distribution. The BTL is
>>>>>>> openib/iWARP via Chelsio's T3 RNIC. In short, a NP60 and
>>>>>>> smaller run completes in a timely manner as expected, but NP61
>>>>>>> and larger runs come to a crawl at the 8KB IO size and take
>>>>>>> ~5-10min to complete. It does complete though. It behaves this
>>>>>>> way even if I run on > 8 nodes so there are available cores. IE
>>>>>>> a NP64 on a 16 node cluster still behaves the same way even
>>>>>>> though there are only 4 ranks on each node. So its apparently
>>>>>>> not a thread starvation issue due to lack of cores. When in the
>>>>>>> stalled state, I see on the order of 100 or so established iwarp
>>>>>>> connections on each node. And the connection count increases
>>>>>>> VERY slowly and sporadically (at its peak there are around 800
>>>>>>> connections for a NP64 gather operation). In comparison, when I
>>>>>>> run the <= NP60 runs, the connections quickly ramp up to the
>>>>>>> expected amount. I added hooks in the openib BTL to track the
>>>>>>> time it takes to setup each connection. In all runs, both <=
>>>>>>> NP60 and > NP60, the average connection setup time is around
>>>>>>> 200ms. And the max setup time seen is never much above this
>>>>>>> value. That tells me that its not individual connection setup
>>>>>>> that is the issue. I then added printfs/fflushes in librdmacm
>>>>>>> to visually see when a connection is attempted and when it is
>>>>>>> accepted. When I run with these printfs, I see the connections
>>>>>>> get setup quickly and evently in the <= NP60 case. Initially
>>>>>>> when the job is started, I see a small flurry of connections
>>>>>>> getting setup, then the run begins and at around 1KB IO size I
>>>>>>> see a 2nd large flurry of connection setups. Then the test
>>>>>>> continues and completes. With the >NP60 case, this second round
>>>>>>> of connection setups is very sporadic and slow. Very slow!
>>>>>>> I'll see little bursts of ~10-20 connections setup, then long
>>>>>>> random pauses. The net is that full connection setup for the
>>>>>>> job takes 5-10min. During this time the ranks are basically
>>>>>>> spinning idle awaiting the connections to get setup. So I'm
>>>>>>> concluding that something above the BTL layer isn't issuing the
>>>>>>> endpoint connect requests in a timely manner.
>>>>>>>
>>>>>>> Attached are 3 padb dumps during the stall. Anybody see
>>>>>>> anything interesting in these?
>>>>>>>
>>>>>>> Any ideas how I can further debug this? Once I get above the
>>>>>>> openib BTL layer my eyes glaze over and I get lost quickly. :)
>>>>>>> I would greatly appreciate any ideas from the OpenMPI experts!
>>>>>>>
>>>>>>>
>>>>>>> Thanks in advance,
>>>>>>>
>>>>>>> Steve.
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>>> --
>>>> Oracle
>>>> Terry D. Dontje | Principal Software Engineer
>>>> Developer Tools Engineering | +1.781.442.2631
>>>> Oracle * - Performance Technologies*
>>>> 95 Network Drive, Burlington, MA 01803
>>>> Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> --
>> Oracle
>> Terry D. Dontje | Principal Software Engineer
>> Developer Tools Engineering | +1.781.442.2631
>> Oracle * - Performance Technologies*
>> 95 Network Drive, Burlington, MA 01803
>> Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



picture
picture