On Mon, May 26, 2014 at 12:09:38PM +0900, Gilles Gouaillardet wrote:
> the assert fails because the endpoint reference count is greater than one.
> the root cause is the endpoint has been added to the list of
> eager_rdma_buffers of the openib btl device (and hence OBJ_RETAIN'ed at
> a simple workaround is not to use eager rdma with the openib btl
> (e.g. export OMPI_MCA_btl_openib_use_eager_rdma=0)
> here is attached a patch that solves the issue.
> because of my poor understanding of the openib btl, i did not commit it.
> i am wondering wether it is safe to simply OBJ_RELEASE the endpoint
> (e.g. what happens if there are inflight messages ?)
> i also added some comments that indicates the patch might be suboptimal.
It should be safe as there should be no flying messages at del_procs. If
there are an error would probably be raised on the sending process.
> Nathan, could you please review the attached patch ?
Sure. I will take a look. It doesn't surprise me there are these sorts
of issues in del_procs. The functionality has been broken for some time.
- application/pgp-signature attachment: stored