This error indicate that somehow we're accessing the QP while the QP is in "down" state. As the asynchronous thread is the one that see this error, I wonder if it doesn't look for some information about a QP that has been destroyed by the main thread (as this only occurs in MPI_Finalize).
Can you look in the syslog to see if there is any additional info related to this issue there?
"All the books in the world contain no more information than is broadcast as video in a single large American city in a single year. Not all bits have equal value.". -- Carl Sagan
On Dec 30, 2010, at 20:43, Eugene Loh <eugene.loh_at_[hidden]> wrote:
> I was running a bunch of np=4 test programs over two nodes. Occasionally, *one* of the codes would see an IBV_EVENT_QP_ACCESS_ERR during MPI_Finalize(). I traced the code and ran another program that mimicked the particular MPI calls made by that program. This other program, too, would occasionally trigger this error. I never saw the problem with other tests. Rate of incidence could go from consecutive runs (I saw this once) to 1:100s (more typically) to even less frequently -- I've had 1000s of consecutive runs with no problems. (The tests run a few seconds apiece.) The traffic pattern is sends from non-zero ranks to rank 0, with root-0 gathers, and lots of Allgathers. The largest messages are 1000bytes. It appears the problem is always seen on rank 3.
> Now, I wouldn't mind someone telling me, based on that little information, what the problem is here, but I guess I don't expect that. What I am asking is what IBV_EVENT_QP_ACCESS_ERR means. Again, it's seen during MPI_Finalize. The async thread is seeing this. What is this error trying to tell me?
> devel mailing list