I'm seeing performance issues I don't understand in my multithreaded
MPI code, and I was hoping someone could shed some light on this.
The code structure is as follows: A computational domain is decomposed
into MPI tasks. Each MPI task has a "master thread" that receives
messages from the other tasks and puts those into a local, concurrent
queue. The tasks then have a few "worker threads" that processes the
incoming messages and when necessary sends them to other tasks. So for
each task, there is one thread doing receives and N (typically number
of cores-1) threads doing sends. All messages are nonblocking, so the
workers just post the sends and continue with computation, and the
master repeatedly does a number of test calls to check for incoming
messages (there are different flavors of these messages so it does
Currently I'm just testing, so I'm running 2 tasks using the sm btl on
one node, and 5 worker threads. (Node has 12 cores.) What happens is
that task 0 receives everything that is sent by task 1 (number of
sends and receives roughly match). However, task 1 only receives about
25% of the messages sent by task 0. Task 0 apparently has no problem
keeping up with receiving the messages from task 1, even though the
throughput in that direction is actually a bit higher. In less than a
minute, there are hundreds of thousands of pending messages (but only
in one direction).At this point, throughput drops by orders of
magnitude to <1000 msg/s. Using PAPI, I can see that the receiving
threads are at that point basically stalled on MPI tests and receives,
and stopping them in the debugger seems to indicate that they are
trying to acquire a lock. However, the test/receive that it is
stalling on is NOT the test for the huge number of pending messages,
but on another class of much rarer ones.
I realize it's hard to know without looking at the code (it's
difficult to whittle it down to a workable example) but does anyone
have any ideas what is happening and how it can be fixed? I don't
know if there are any problems with the basic structure of the code.
For example, are the simultaneous send/receives in different threads
bound to cause lock contention on the MPI side? How does the MPI
library decide which thread is used for actual message processing?
Does every nonblocking MPI call just "steal" a time slice to work on
communications or does MPI have its own thread dedicated to message
processing? What I would like is that the master thread devote all its
time to communication, while the sends by the worker threads should
just return as fast as possible. Would it be better that the thread
doing receives do one large wait instead of repeatedly testing
different sets of requests, or would that acquire some lock and then
block the threads trying to post a send?
I've looked around for info on how to best structure multithreaded MPI
code, but haven't had much luck in finding anything.
This is with OpenMPI 1.5.3 using MPI_THREAD_MULTIPLE on a Dell
PowerEdge C6100 running linux kernel 2.6.18-194.32.1.el5, using Intel
12.3.174. I've attached the ompi_info output.