Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] openib btl and cq overflows
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2012-07-03 12:36:13


We talked about this on the weekly call today. Conclusions:

1. Looks like we just goofed on the CQ default size values. Doh!
2. There does not appear to be any reason we're not using the device CQ max size by default. Ticket #3152 changes the trunk to do this (and we'll CMR to v1.6 and v1.7).

On Jul 2, 2012, at 6:05 PM, Steve Wise wrote:

> If I use --mca btl_oepnib_cq_size and override the computed CQ depth, then I can indeed avoid the CQ overflows.
>
> On 7/2/2012 4:12 PM, Jeff Squyres wrote:
>> You know, I have the following in a few of my MTT configurations:
>>
>> -----
>> # See if this makes the CQ overrun errors go away
>> cq_depth = " --mca btl_openib_cq_size 65536 "
>> -----
>>
>> And then I use that variable as an mpirun CLI option in a few places. It looks like something left over from a long time ago that never got followed up on...
>>
>> So yes, I'm guessing there's some kind of incorrect CQ sizing issue going on. Can someone point Steve to the right place to look in the openib BTL?
>>
>>
>>
>> On Jul 2, 2012, at 11:24 AM, Steve Wise wrote:
>>
>>> Hello,
>>>
>>> I'm debugging an issue with openmpi-1.4.5 and the openib btl over chelsio iwarp devices. I am the iwarp driver developer for this device. I have debug code that detects cq overflows, and I'm seeing rcq overflows during finalize for certain IMB runs with ompi. So as the recv wrs are flushed, I am seeing an overflow in the rcq for that qp. Note chelsio iwarp uses non-shared rqs and its default .ini is: receive_queues = P,65536,256,192,128
>>>
>>> Here's the job details:
>>>
>>> NP=16; mpirun -np ${NP} --host core96b1,core96b2,core96b3,core96b4 --mca btl openib,sm,self /opt/openmpi-1.4.5/tests/IMB-3.2/IMB-MPI1 -npmin ${NP} alltoall
>>>
>>> The nodes have 4 port iwarp adapters in them so there are rdma connections setup over each port. As the alltoall IO size hits 256, we end up with 192 qps per node. And that seems to be the stable qp count until the test finishes and we see the overflow.
>>>
>>> I added further debug code in my rdma provider library to track the total depth of all the qps bound to each cq to see if the application is oversubscribing the cqs. I see that for these jobs, OMPI is in fact oversubscribing the cqs. Here's a snipit of my debug output:
>>>
>>> warning, potential SCQ overflow: total_qp_depth 3120 SCQ depth 1088
>>> warning, potential RCQ overflow: total_qp_depth 3312 RCQ depth 1088
>>> warning, potential SCQ overflow: total_qp_depth 3120 SCQ depth 1088
>>> warning, potential RCQ overflow: total_qp_depth 3312 RCQ depth 1088
>>>
>>> I realize that OMPI can in fact be flow controlling such that the cq won't overflow even if the total qp depths exceeds the cq depth. But I do see overflows. And it seems that a cq depth of 1088 is quite small given the size of the sq or rq in the above debug output. So it seems that ompi isn't scaling the CQ depth according to the job.
>>>
>>> As an experiment, I overrode the cq depth by adding '--mca btl_openib_cq_size 16000' to the mpirun line and I don't see the overflow anymore.
>>>
>>> Can all you openib btl experts out there describe the CQ sizing logic and point me to the code that I can dig into to see why we're overflowing the RCQ on finalize operations? Also, does the cq depth of 1088 seem reasonable for this type of work load?
>>>
>>> Thanks in advance!
>>>
>>> Steve.
>>>
>>
>
>

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/