Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] mca_btl_openib_post_srr() posts to an uncreated SRQwhen ibv_resize_cq() has failed
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-10-26 17:50:58


Thanks for the analysis!

We've argued about btl_r2_add_btls() before -- IIRC, the consensus is
that we want it to be able to continue even if a BTL fails. So I
*think* that your #1 answer is better.

However, we might want to try a little harder if EINVAL is returned --
perhaps try decreasing number of CQ entries and try again until either
we have too few CQ entries to be useful (e.g., 0 or some higher number
that is still "too small"), or fail the BTL alltogether...?

On Oct 23, 2009, at 10:10 AM, Nadia Derbey wrote:

> Hi,
>
> Yesterdays I had to analyze a SIGSEV occuring after the following
> message had been output:
> [.... adjust_cq] cannot resize completion queue, error: 22
>
>
> What I found is the following:
>
> When ibv_resize_cq() fails to resize a CQ (in my case it returned
> EINVAL), adjust_cq() returns an error and create_srq() is not called
> by
> mca_btl_openib_size_queues().
>
> Note: One of our infiniband specialists told me that EINVAL was
> returned
> in that case because we were asking for more CQ entries than the max
> available.
>
> mca_bml_r2_add_btls() goes on executing.
>
> Then qp_create_all() is called (connect/btl_openib_connect_oob.c).
> ibv_create_qp() succeeds even though init_attr.srq is a NULL pointer
> (remember that create_srq() has not been previously called).
>
> Since all the QPs have been successfully created, qp_create_all() then
> calls:
> mca_btl_openib_endpoint_post_recvs()
> --> mca_btl_openib_post_srr()
> --> ibv_post_srq_recv() on a NULL SRQ
> ==> SIGSEGV
>
>
> If I'm not wrong in the analysis above, we have the choice between 2
> solutions to fix this problem:
>
> 1. if EINVAL is returned by ibv_resize_cq() in adjust_cq(), treat this
> as the ENOSYS case: do not return an error, since the CQ has
> successfully been created may be with less entries than needed, but it
> is there.
>
> Doing this we assume that EINVAL will always be the symptom of a "too
> many entries asked for" error from the IB stack. I don't have the
> answer...
> + I don't know if this won't imply a degraded mode in terms of
> performances.
>
> 2. Fix mca_bml_r2_add_btls() to cleanly exit if an error occurs during
> btl_add_procs().
>
> FYI I tested solution #1 and it worked...
>
> Any suggestion or comment would be welcome.
>
> Regards,
> Nadia
>
> --
> Nadia Derbey <Nadia.Derbey_at_[hidden]>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

-- 
Jeff Squyres
jsquyres_at_[hidden]