Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] IBV_EVENT_QP_ACCESS_ERR
From: Jeff Squyres (jsquyres) (jsquyres_at_[hidden])
Date: 2011-01-03 14:44:17


I'd guess thesame thing as George - a race condition in the shutdown of the async thread...? I haven't looked at that code in a long log time to remember how it tried to defend against the race condition.

Sent from my PDA. No type good.

On Jan 3, 2011, at 2:31 PM, "Eugene Loh" <eugene.loh_at_[hidden]> wrote:

> George Bosilca wrote:
>
>> Eugene,
>>
>> This error indicate that somehow we're accessing the QP while the QP is in "down" state. As the asynchronous thread is the one that see this error, I wonder if it doesn't look for some information about a QP that has been destroyed by the main thread (as this only occurs in MPI_Finalize).
>>
>> Can you look in the syslog to see if there is any additional info related to this issue there?
>>
> Not much. A one-liner like this:
>
> Dec 27 21:49:36 burl-ct-x4150-11 hermon: [ID 492207 kern.info] hermon1: EQE local access violation
>
>> On Dec 30, 2010, at 20:43, Eugene Loh <eugene.loh_at_[hidden]> wrote:
>>
>>> I was running a bunch of np=4 test programs over two nodes. Occasionally, *one* of the codes would see an IBV_EVENT_QP_ACCESS_ERR during MPI_Finalize(). I traced the code and ran another program that mimicked the particular MPI calls made by that program. This other program, too, would occasionally trigger this error. I never saw the problem with other tests. Rate of incidence could go from consecutive runs (I saw this once) to 1:100s (more typically) to even less frequently -- I've had 1000s of consecutive runs with no problems. (The tests run a few seconds apiece.) The traffic pattern is sends from non-zero ranks to rank 0, with root-0 gathers, and lots of Allgathers. The largest messages are 1000bytes. It appears the problem is always seen on rank 3.
>>>
>>> Now, I wouldn't mind someone telling me, based on that little information, what the problem is here, but I guess I don't expect that. What I am asking is what IBV_EVENT_QP_ACCESS_ERR means. Again, it's seen during MPI_Finalize. The async thread is seeing this. What is this error trying to tell me?
>>>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel