Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Still problems with del_procs in trunkj
From: Nathan Hjelm (hjelmn_at_[hidden])
Date: 2014-05-27 10:45:17


On Mon, May 26, 2014 at 12:09:38PM +0900, Gilles Gouaillardet wrote:
> Rolf,
>
> the assert fails because the endpoint reference count is greater than one.
> the root cause is the endpoint has been added to the list of
> eager_rdma_buffers of the openib btl device (and hence OBJ_RETAIN'ed at
> ompi/mca/btl/openib/btl_openib_endpoint.c:1009)
>
> a simple workaround is not to use eager rdma with the openib btl
> (e.g. export OMPI_MCA_btl_openib_use_eager_rdma=0)
>
> here is attached a patch that solves the issue.
>
> because of my poor understanding of the openib btl, i did not commit it.
> i am wondering wether it is safe to simply OBJ_RELEASE the endpoint
> (e.g. what happens if there are inflight messages ?)
> i also added some comments that indicates the patch might be suboptimal.

It should be safe as there should be no flying messages at del_procs. If
there are an error would probably be raised on the sending process.

> Nathan, could you please review the attached patch ?

Sure. I will take a look. It doesn't surprise me there are these sorts
of issues in del_procs. The functionality has been broken for some time.

-Nathan



  • application/pgp-signature attachment: stored