Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] NP64 _gather_ problem
From: Steve Wise (swise_at_[hidden])
Date: 2010-09-17 11:41:51


  Yes it does. With mpi_preconnect_mpi to 1, NP64 doesn't stall. So
its not the algorithm in and of itself, but rather some interplay
between the algorithm and connection setup I guess.

On 9/17/2010 5:24 AM, Terry Dontje wrote:
> Does setting mca parameter mpi_preconnect_mpi to 1 help at all. This
> might be able to help determine if it is the actually connection set
> up between processes that are out of sync as oppose to something in
> the actual gather algorithm.
>
> --td
>
> Steve Wise wrote:
>> Here's a clue: ompi_coll_tuned_gather_intra_dec_fixed() changes its
>> algorithm for job sizes > 60 to some binomial method. I changed the
>> threshold to 100 and my NP64 jobs run fine. Now to try and
>> understand what about ompi_coll_tuned_gather_intra_binomial() is
>> causing these connect delays...
>>
>>
>> On 9/16/2010 1:01 PM, Steve Wise wrote:
>>> Oops. One key typo here: This is the IMB-MPI1 gather test, not
>>> barrier. :(
>>>
>>>
>>> On 9/16/2010 12:05 PM, Steve Wise wrote:
>>>> Hi,
>>>>
>>>> I'm debugging a performance problem with running IMB-MP1/barrier in
>>>> an NP64 cluster (8 nodes, 8 cores each). I'm using openmpi-1.4.1
>>>> from the OFED-1.5.1 distribution. The BTL is openib/iWARP via
>>>> Chelsio's T3 RNIC. In short, a NP60 and smaller run completes in a
>>>> timely manner as expected, but NP61 and larger runs come to a
>>>> crawl at the 8KB IO size and take ~5-10min to complete. It does
>>>> complete though. It behaves this way even if I run on > 8 nodes so
>>>> there are available cores. IE a NP64 on a 16 node cluster still
>>>> behaves the same way even though there are only 4 ranks on each
>>>> node. So its apparently not a thread starvation issue due to lack
>>>> of cores. When in the stalled state, I see on the order of 100 or
>>>> so established iwarp connections on each node. And the connection
>>>> count increases VERY slowly and sporadically (at its peak there are
>>>> around 800 connections for a NP64 gather operation). In
>>>> comparison, when I run the <= NP60 runs, the connections quickly
>>>> ramp up to the expected amount. I added hooks in the openib BTL to
>>>> track the time it takes to setup each connection. In all runs,
>>>> both <= NP60 and > NP60, the average connection setup time is
>>>> around 200ms. And the max setup time seen is never much above this
>>>> value. That tells me that its not individual connection setup that
>>>> is the issue. I then added printfs/fflushes in librdmacm to
>>>> visually see when a connection is attempted and when it is
>>>> accepted. When I run with these printfs, I see the connections get
>>>> setup quickly and evently in the <= NP60 case. Initially when the
>>>> job is started, I see a small flurry of connections getting setup,
>>>> then the run begins and at around 1KB IO size I see a 2nd large
>>>> flurry of connection setups. Then the test continues and
>>>> completes. With the >NP60 case, this second round of connection
>>>> setups is very sporadic and slow. Very slow! I'll see little
>>>> bursts of ~10-20 connections setup, then long random pauses. The
>>>> net is that full connection setup for the job takes 5-10min.
>>>> During this time the ranks are basically spinning idle awaiting the
>>>> connections to get setup. So I'm concluding that something above
>>>> the BTL layer isn't issuing the endpoint connect requests in a
>>>> timely manner.
>>>>
>>>> Attached are 3 padb dumps during the stall. Anybody see anything
>>>> interesting in these?
>>>>
>>>> Any ideas how I can further debug this? Once I get above the
>>>> openib BTL layer my eyes glaze over and I get lost quickly. :) I
>>>> would greatly appreciate any ideas from the OpenMPI experts!
>>>>
>>>>
>>>> Thanks in advance,
>>>>
>>>> Steve.
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
> Oracle
> Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.781.442.2631
> Oracle * - Performance Technologies*
> 95 Network Drive, Burlington, MA 01803
> Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



picture