On Apr 14, 2009, at 12:02 PM, Shaun Jackman wrote:
> Hi Eugene,
> Eugene Loh wrote:
>> At 2500 bytes, all messages will presumably be sent "eagerly" --
>> without waiting for the receiver to indicate that it's ready to
>> receive that particular message. This would suggest congestion, if
>> any, is on the receiver side. Some kind of congestion could, I
>> suppose, still occur and back up on the sender side.
> Can anyone chime in as to what the message size limit is for an
> `eager' transmission?
>> On the other hand, I assume the memory imbalance we're talking
>> about is rather severe. Much more than 2500 bytes to be
>> noticeable, I would think. Is that really the situation you're
> The memory imbalance is drastic. I'm expecting 2 GB of memory use
> per process. The heaving processes (13/16) use the expected amount
> of memory; the remainder (3/16) misbehaving processes use more than
> twice as much memory. The specifics vary from run to run of course.
> So, yes, there is gigs of unexpected memory use to track down.
>> There are tracing tools to look at this sort of thing. The only
>> one I have much familiarity with is Sun Studio / Sun HPC
>> ClusterTools. Free download, available on Solaris or Linux, SPARC
>> or x64, plays with OMPI. You can see a timeline with message lines
>> on it to give you an idea if messages are being received/completed
>> long after they were sent. Another interesting view is
>> constructing a plot vs time of how many messages are in-flight at
>> any moment (including as a function of receiver). Lots of similar
>> tools out there... VampirTrace (tracing side only, need to analyze
>> the data), Jumpshot, etc. Again, though, there's a question in my
>> mind if you're really backing up 1000s or more of messages. (I'm
>> assuming the memory imbalances are at least Mbytes.)
> I'll check out Sun HPC ClusterTools. Thanks for the tip.
> Assuming the problem is congestion and that messages are backing up,
> is there an accepted method of dealing with this situation? It seems
> to me the general approach would be
> if (number of outstanding messages > high water mark)
> wait until (number of outstanding messages < low water mark)
> where I suppose the `number of outstanding messages' is defined as
> the number of messages that have been sent and not yet received by
> the other side. Is there a way to get this number from MPI without
> having to code it at the application level?
It isn't quite that simple. The problem is that these are typically
"unexpected" messages - i.e., some processes are running faster than
this one, so this one keeps falling behind, which means it has to
"stockpile" messages for later processing.
It is impossible to predict who is going to send the next unexpected
message, so attempting to say "wait" means sending a broadcast to all
procs - a very expensive operation, especially since it can be any
number of procs that feel overloaded.
We had the same problem when working with collectives, where memory
was being overwhelmed by stockpiled messages. The solution (available
in the 1.3 series) in that case was to use the "sync" collective
system. This monitors the number of times a collective is being
executed that can cause this type of problem, and then inserts an
MPI_Barrier to allow time for the processes to "drain" all pending
messages. You can control how frequently this happens, and whether to
barrier occurs before or after the specified number of operations.
If you are using collectives, or can reframe the algorithm so you do,
you might give that a try - it has solved similar problems here. If
it helps, then you should "tune" it by increasing the provided number
(thus decreasing the frequency of the inserted barrier) until you find
a value that works for you - this will minimize performance impact on
your job caused by the inserted barriers.
If you are not using collectives and/or cannot do so, then perhaps we
need to consider a similar approach for simple send/recv operations.
It would probably have to be done inside the MPI library, but may be
hard to implement. The collective works because we know everyone has
to be in it. That isn't true for send/recv, so the barrier approach
won't work there - we would need some other method of stopping procs
to allow things to catch up.
Not sure what that would be offhand....but perhaps some other wiser
head will think of something!
> users mailing list