Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] v1.5: sigsegv in case of extremely low settings in theSRQs
From: Sylvain Jeaugey (sylvain.jeaugey_at_[hidden])
Date: 2010-06-23 10:46:06


Hi Jeff,

Why do we want to set this value so low ? Well, just to see if it crashes
:-)

More seriously, we're working on lowering the memory usage of the openib
BTL, which is achieved at most by having only 1 send queue element (at
very large scale, send queues prevail).

This "extreme" configuration used to work with the 1.3/1.4 branches but
failed on 1.5.

Note that recent IB cards having very high issue rates, I don't know if we
are often waiting for the send queue to be empty. More importantly, it
often prevents remote receive queue to be filled to quickly (which
prevents RNR nacks, threads refilling the SRQ, ...). We didn't notice
major performance drops with this configuration.

Sylvain

On Tue, 22 Jun 2010, Jeff Squyres wrote:

> I think your fix looks right.
>
> But I'm getting my head warped trying to understand why you'd want
> numbers so low (4, 2, 1) and exactly what our algorithm will re-post for
> numbers that low, etc. Why do you want them so low?
>
>
> On Jun 18, 2010, at 11:10 AM, nadia.derbey wrote:
>
>> Hi,
>>
>> Reference is the v1.5 branch
>>
>> If an SRQ has the following settings: S,<size>,4,2,1
>>
>> 1) setup_qps() sets the following:
>> mca_btl_openib_component.qp_infos[qp].u.srq_qp.rd_num=4
>> mca_btl_openib_component.qp_infos[qp].u.srq_qp.rd_init=rd_num/4=1
>>
>> 2) create_srq() sets the following:
>> openib_btl->qps[qp].u.srq_qp.rd_curr_num = 1 (rd_init value)
>> openib_btl->qps[qp].u.srq_qp.rd_low_local = rd_curr_num - (rd_curr_num
>>>> 2) = rd_curr_num = 1
>>
>> 3) if mca_btl_openib_post_srr() is called with rd_posted=1:
>> rd_posted > rd_low_local is false
>> num_post=rd_curr_num-rd_posted=0
>> the loop is not executed
>> wr is never initialized (remains NULL)
>> wr->next: address not mapped
>> ==> SIGSEGV
>>
>> The attached patch solves the problem by ensuring that we'll actually
>> enter the loop and leave otherwise.
>> Can someone have a look please: the patch solves the problem with my
>> reproducer, but I'm not sure the fix covers all the situations.
>>
>> Regards,
>> Nadia
>>
>> <001_openib_low_rd_num.patch>_______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>