Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] NP64 _gather_ problem
From: Steve Wise (swise_at_[hidden])
Date: 2010-09-17 16:46:37


  I'll look into Solaris Studio. I think somehow the connections are
getting single threaded or somehow funneled due to the gather
algorithm. And since they are taking ~160ms to setup each one, and
there are ~3600 connections getting setup, we end up with a 7 minute run
time. Now, 160ms seems way too high for setting up even an iWARP
connection which has some streaming mode TCP exchanges as part of
connection setup. I would think it should be around a few hundred
_usecs_. So I'm pursuing the connect latency too.

Thanks,

Steve.

On 9/17/2010 12:13 PM, Terry Dontje wrote:
> Right, by default all connections will be handled on the fly. So as
> an MPI_Send is executed to a process that there is not a connection to
> then a dance happens between the sender and the receiver. So why this
> happens with np > 60 may have to do with how many connections are
> happening at the same time or if the destination of one connection
> request is not in the MPI library.
>
> It would be interesting to figure out when in the timeline of the job
> that such requests are are being delayed. You can get such a timeline
> by using a tool like Solaris Studio collector/analyzer (which actually
> has a Linux version).
>
> --td
>
> Steve Wise wrote:
>> Yes it does. With mpi_preconnect_mpi to 1, NP64 doesn't stall. So
>> its not the algorithm in and of itself, but rather some interplay
>> between the algorithm and connection setup I guess.
>>
>>
>> On 9/17/2010 5:24 AM, Terry Dontje wrote:
>>> Does setting mca parameter mpi_preconnect_mpi to 1 help at all.
>>> This might be able to help determine if it is the actually
>>> connection set up between processes that are out of sync as oppose
>>> to something in the actual gather algorithm.
>>>
>>> --td
>>>
>>> Steve Wise wrote:
>>>> Here's a clue: ompi_coll_tuned_gather_intra_dec_fixed() changes
>>>> its algorithm for job sizes > 60 to some binomial method. I
>>>> changed the threshold to 100 and my NP64 jobs run fine. Now to try
>>>> and understand what about ompi_coll_tuned_gather_intra_binomial()
>>>> is causing these connect delays...
>>>>
>>>>
>>>> On 9/16/2010 1:01 PM, Steve Wise wrote:
>>>>> Oops. One key typo here: This is the IMB-MPI1 gather test, not
>>>>> barrier. :(
>>>>>
>>>>>
>>>>> On 9/16/2010 12:05 PM, Steve Wise wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I'm debugging a performance problem with running IMB-MP1/barrier
>>>>>> in an NP64 cluster (8 nodes, 8 cores each). I'm using
>>>>>> openmpi-1.4.1 from the OFED-1.5.1 distribution. The BTL is
>>>>>> openib/iWARP via Chelsio's T3 RNIC. In short, a NP60 and smaller
>>>>>> run completes in a timely manner as expected, but NP61 and
>>>>>> larger runs come to a crawl at the 8KB IO size and take ~5-10min
>>>>>> to complete. It does complete though. It behaves this way even
>>>>>> if I run on > 8 nodes so there are available cores. IE a NP64 on
>>>>>> a 16 node cluster still behaves the same way even though there
>>>>>> are only 4 ranks on each node. So its apparently not a thread
>>>>>> starvation issue due to lack of cores. When in the stalled
>>>>>> state, I see on the order of 100 or so established iwarp
>>>>>> connections on each node. And the connection count increases
>>>>>> VERY slowly and sporadically (at its peak there are around 800
>>>>>> connections for a NP64 gather operation). In comparison, when I
>>>>>> run the <= NP60 runs, the connections quickly ramp up to the
>>>>>> expected amount. I added hooks in the openib BTL to track the
>>>>>> time it takes to setup each connection. In all runs, both <=
>>>>>> NP60 and > NP60, the average connection setup time is around
>>>>>> 200ms. And the max setup time seen is never much above this
>>>>>> value. That tells me that its not individual connection setup
>>>>>> that is the issue. I then added printfs/fflushes in librdmacm
>>>>>> to visually see when a connection is attempted and when it is
>>>>>> accepted. When I run with these printfs, I see the connections
>>>>>> get setup quickly and evently in the <= NP60 case. Initially
>>>>>> when the job is started, I see a small flurry of connections
>>>>>> getting setup, then the run begins and at around 1KB IO size I
>>>>>> see a 2nd large flurry of connection setups. Then the test
>>>>>> continues and completes. With the >NP60 case, this second round
>>>>>> of connection setups is very sporadic and slow. Very slow! I'll
>>>>>> see little bursts of ~10-20 connections setup, then long random
>>>>>> pauses. The net is that full connection setup for the job takes
>>>>>> 5-10min. During this time the ranks are basically spinning idle
>>>>>> awaiting the connections to get setup. So I'm concluding that
>>>>>> something above the BTL layer isn't issuing the endpoint connect
>>>>>> requests in a timely manner.
>>>>>>
>>>>>> Attached are 3 padb dumps during the stall. Anybody see anything
>>>>>> interesting in these?
>>>>>>
>>>>>> Any ideas how I can further debug this? Once I get above the
>>>>>> openib BTL layer my eyes glaze over and I get lost quickly. :)
>>>>>> I would greatly appreciate any ideas from the OpenMPI experts!
>>>>>>
>>>>>>
>>>>>> Thanks in advance,
>>>>>>
>>>>>> Steve.
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>> --
>>> Oracle
>>> Terry D. Dontje | Principal Software Engineer
>>> Developer Tools Engineering | +1.781.442.2631
>>> Oracle * - Performance Technologies*
>>> 95 Network Drive, Burlington, MA 01803
>>> Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
> Oracle
> Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.781.442.2631
> Oracle * - Performance Technologies*
> 95 Network Drive, Burlington, MA 01803
> Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



picture
picture