Thanks for the analysis!
We've argued about btl_r2_add_btls() before -- IIRC, the consensus is
that we want it to be able to continue even if a BTL fails. So I
*think* that your #1 answer is better.
However, we might want to try a little harder if EINVAL is returned --
perhaps try decreasing number of CQ entries and try again until either
we have too few CQ entries to be useful (e.g., 0 or some higher number
that is still "too small"), or fail the BTL alltogether...?
On Oct 23, 2009, at 10:10 AM, Nadia Derbey wrote:
> Yesterdays I had to analyze a SIGSEV occuring after the following
> message had been output:
> [.... adjust_cq] cannot resize completion queue, error: 22
> What I found is the following:
> When ibv_resize_cq() fails to resize a CQ (in my case it returned
> EINVAL), adjust_cq() returns an error and create_srq() is not called
> Note: One of our infiniband specialists told me that EINVAL was
> in that case because we were asking for more CQ entries than the max
> mca_bml_r2_add_btls() goes on executing.
> Then qp_create_all() is called (connect/btl_openib_connect_oob.c).
> ibv_create_qp() succeeds even though init_attr.srq is a NULL pointer
> (remember that create_srq() has not been previously called).
> Since all the QPs have been successfully created, qp_create_all() then
> --> mca_btl_openib_post_srr()
> --> ibv_post_srq_recv() on a NULL SRQ
> ==> SIGSEGV
> If I'm not wrong in the analysis above, we have the choice between 2
> solutions to fix this problem:
> 1. if EINVAL is returned by ibv_resize_cq() in adjust_cq(), treat this
> as the ENOSYS case: do not return an error, since the CQ has
> successfully been created may be with less entries than needed, but it
> is there.
> Doing this we assume that EINVAL will always be the symptom of a "too
> many entries asked for" error from the IB stack. I don't have the
> + I don't know if this won't imply a degraded mode in terms of
> 2. Fix mca_bml_r2_add_btls() to cleanly exit if an error occurs during
> FYI I tested solution #1 and it worked...
> Any suggestion or comment would be welcome.
> Nadia Derbey <Nadia.Derbey_at_[hidden]>
> devel mailing list