It looks that we are touching some QP that was released. Before close the QP we make sure to complete all outstanding messages on the endpoint. Once all qps (and other resources) are closed , we signal to async thread to remove this hca from monitoring list. For me it looks that somehow we close the QP before all outstanding requests were completed.
Pavel Shamis (Pasha)
On Jan 3, 2011, at 12:44 PM, Jeff Squyres (jsquyres) wrote:
> I'd guess thesame thing as George - a race condition in the shutdown of the async thread...? I haven't looked at that code in a long log time to remember how it tried to defend against the race condition.
> Sent from my PDA. No type good.
> On Jan 3, 2011, at 2:31 PM, "Eugene Loh" <eugene.loh_at_[hidden]> wrote:
>> George Bosilca wrote:
>>> This error indicate that somehow we're accessing the QP while the QP is in "down" state. As the asynchronous thread is the one that see this error, I wonder if it doesn't look for some information about a QP that has been destroyed by the main thread (as this only occurs in MPI_Finalize).
>>> Can you look in the syslog to see if there is any additional info related to this issue there?
>> Not much. A one-liner like this:
>> Dec 27 21:49:36 burl-ct-x4150-11 hermon: [ID 492207 kern.info] hermon1: EQE local access violation
>>> On Dec 30, 2010, at 20:43, Eugene Loh <eugene.loh_at_[hidden]> wrote:
>>>> I was running a bunch of np=4 test programs over two nodes. Occasionally, *one* of the codes would see an IBV_EVENT_QP_ACCESS_ERR during MPI_Finalize(). I traced the code and ran another program that mimicked the particular MPI calls made by that program. This other program, too, would occasionally trigger this error. I never saw the problem with other tests. Rate of incidence could go from consecutive runs (I saw this once) to 1:100s (more typically) to even less frequently -- I've had 1000s of consecutive runs with no problems. (The tests run a few seconds apiece.) The traffic pattern is sends from non-zero ranks to rank 0, with root-0 gathers, and lots of Allgathers. The largest messages are 1000bytes. It appears the problem is always seen on rank 3.
>>>> Now, I wouldn't mind someone telling me, based on that little information, what the problem is here, but I guess I don't expect that. What I am asking is what IBV_EVENT_QP_ACCESS_ERR means. Again, it's seen during MPI_Finalize. The async thread is seeing this. What is this error trying to tell me?
>> devel mailing list
> devel mailing list