Retrying w/ fewer CQ entires as Jeff describes is a good idea to help
ensure that EINVAL actually does signify that the count exceeds the max
instead of just assuming this is so). If it actually was signifying
some other error case, then one would probably not want to continue.
Jeff Squyres wrote:
> Thanks for the analysis!
> We've argued about btl_r2_add_btls() before -- IIRC, the consensus is
> that we want it to be able to continue even if a BTL fails. So I
> *think* that your #1 answer is better.
> However, we might want to try a little harder if EINVAL is returned --
> perhaps try decreasing number of CQ entries and try again until either
> we have too few CQ entries to be useful (e.g., 0 or some higher number
> that is still "too small"), or fail the BTL alltogether...?
> On Oct 23, 2009, at 10:10 AM, Nadia Derbey wrote:
>> Yesterdays I had to analyze a SIGSEV occuring after the following
>> message had been output:
>> [.... adjust_cq] cannot resize completion queue, error: 22
>> What I found is the following:
>> When ibv_resize_cq() fails to resize a CQ (in my case it returned
>> EINVAL), adjust_cq() returns an error and create_srq() is not called by
>> Note: One of our infiniband specialists told me that EINVAL was returned
>> in that case because we were asking for more CQ entries than the max
>> mca_bml_r2_add_btls() goes on executing.
>> Then qp_create_all() is called (connect/btl_openib_connect_oob.c).
>> ibv_create_qp() succeeds even though init_attr.srq is a NULL pointer
>> (remember that create_srq() has not been previously called).
>> Since all the QPs have been successfully created, qp_create_all() then
>> --> mca_btl_openib_post_srr()
>> --> ibv_post_srq_recv() on a NULL SRQ
>> ==> SIGSEGV
>> If I'm not wrong in the analysis above, we have the choice between 2
>> solutions to fix this problem:
>> 1. if EINVAL is returned by ibv_resize_cq() in adjust_cq(), treat this
>> as the ENOSYS case: do not return an error, since the CQ has
>> successfully been created may be with less entries than needed, but it
>> is there.
>> Doing this we assume that EINVAL will always be the symptom of a "too
>> many entries asked for" error from the IB stack. I don't have the
>> + I don't know if this won't imply a degraded mode in terms of
>> 2. Fix mca_bml_r2_add_btls() to cleanly exit if an error occurs during
>> FYI I tested solution #1 and it worked...
>> Any suggestion or comment would be welcome.
>> Nadia Derbey <Nadia.Derbey_at_[hidden]>
>> devel mailing list
Paul H. Hargrove PHHargrove_at_[hidden]
Future Technologies Group Tel: +1-510-495-2352
HPC Research Department Fax: +1-510-486-6900
Lawrence Berkeley National Laboratory