These all look like fine suggestions.
Another tool you should consider using for this problem or others like
it in the future is TotalView. It seems like there
are two related questions in your current troubleshooting scenario:
1. is the memory being used where you think it is?
2. is there really an imbalance between send/receives that is clogging
the unexpected queue?
I'd fire up the application under TotalView with memory debugging
enabled (one of the checkboxes
that will be right there when you start the debugger).
Once you have run to the point where you are seeing the memory
imbalance (and you don't have to wait for it to
get "bad" it can just be "noticeable"). Then you want to stop all the
processes by clicking stop.
Then open the memory debugging window from the "debug" menu item.
Then check the "memory statistics" view to make sure that you know
which MPI process
it is that is using more memory than the others.
Is the difference in the "heap memory"? I'm guessing it will be, but I
suppose there is always the possibility I'm
wrong so it is good to check. The memory statistics view should show
different kinds of memory.
Then select the process that is using more memory (we can call it the
process of interest) and run a "heap status" report.
This should tell you "where" your memory usage is coming from in your
program. You should get stack backtraces for all
the allocations. Depending on the magnitude of the memory usage it may
"pop right out" in the numbers or you might have to
dig a bit. I'm not sure exactly what the backtrace of the kind of
memory allocation you are talking about would look like..
One great way to pick up on more "subtle" allocations is to compare
the memory usage of a process that is behaving correctly
and the process that is behaving incorrectly.
You can do that by selecting two processes and doing a "memory
comparison" -- that will basically filter all the allocations "out of
that are "the same (in terms of backtrace)" between the two processes.
If you have several hundred extra allocations from the OpenMPI
runtime on the one processes they should be easier to find in the
difference view. If the two processes have other differences you'll
get a longer
list but if you know your code you'll hopefully be able to quickly
eliminate the ones that are "expected differences".
It sounds like you have a strong working hypothesis. However, it might
be useful to run a memory leak check on the process of interest..
as that is another common way to get a process that starts taking up a
lot of extra memory. If your working hypothesis is correct your
process of interest should come back "clean" in terms of leaks.
Another technique that TotalView will give you the ability to bring to
bear is inspection of the MPI message queues. This can be done, again,
while the processes are stopped once the memory imbalance is
"noticeable". Click on the tools menu and select "message queue graph".
That should bring up a graphical display of the state of the MPI
message queues in all of your MPI processes. If your hypothesis is
correct there should be an extremely large number of unexpected
messages shown for your process of interest.
One of the nice things about this view, when compared to the MPI
tracing tools mentioned previously, is that it will only show you the
which are in the queues at the point in time where you paused all the
MPI tasks.. which may be a lot of messages, but it is likely to be
many orders of magnitude lower than the number of MPI messages
displayed on the trace.
TV is commercial but a 15 day evaluation license can be obtained here
5 minute Videos on Memory debugging and MPI debugging (that go over
some, but probably not all of the things that I discussed above) are
Don't hesitate to contact me if you want help, the guys at "support_at_[hidden]
can also help and are available during a product evaluation.
Oh, and I should mention that there is a free version of TotalView
available for students. :)
Chris Gottbrath, 508-652-7735 or 774-270-3155
Director of Product Management, TotalView Technologies
Learn how to radically simplify your debugging:
On Apr 14, 2009, at 4:54 PM, Eugene Loh wrote:
> Shaun Jackman wrote:
>> Eugene Loh wrote:
>>>>> On the other hand, I assume the memory imbalance we're talking
>>>>> about is rather severe. Much more than 2500 bytes to be
>>>>> noticeable, I would think. Is that really the situation you're
>>>> The memory imbalance is drastic. I'm expecting 2 GB of memory use
>>>> per process. The heaving processes (13/16) use the expected
>>>> amount of memory; the remainder (3/16) misbehaving processes use
>>>> more than twice as much memory. The specifics vary from run to
>>>> run of course. So, yes, there is gigs of unexpected memory use to
>>>> track down.
>>> Umm, how big of a message imbalance do you think you might have?
>>> (The inflection in my voice doesn't come out well in e-mail.)
>>> Anyhow, that sounds like, um, "lots" of 2500-byte messages.
>> The message imbalance could be very large. Each process is running
>> pretty close to its memory capacity. If a backlog of messages
>> causes a buffer to grow to the point where the process starts
>> swapping, it will very quickly fall very far behind. There are some
>> billion 25-byte operations being sent in total or tens of millions
>> MPI_Send messages (at 100 operations per MPI_Send).
> Okay. Attached is a "little" note I wrote up illustrating memory
> profiling with Sun tools. (It's "big" because I ended up including
> a few screenshots.) The program has a bunch of one-way message
> traffic and some user-code memory allocation. I then rerun with the
> receiver sleeping before jumping into action. The messages back up
> and OMPI ends up allocating a bunch of memory. The tools show you
> who (user or OMPI) is allocating how much memory and how big of a
> message backlog develops and how the sender starts stalling out
> (which is a good thing!). Anyhow, a useful exercise for me and
> hopefully helpful for you.
> users mailing list
This transmission contains confidential and/or legally privileged
TotalView Technologies intended only for the use of the individual(s)
to which it is
addressed. If you are not the intended recipient, you are hereby
notified that any disclosure,
copying or distribution of this information or the taking of any
action in reliance on the
contents of this transmission is strictly prohibited. If you have
received this transmission
in error, please notify us immediately.
This transmission contains confidential and/or legally privileged information from
TotalView Technologies intended only for the use of the individual(s) to which it is
addressed. If you are not the intended recipient, you are hereby notified that any disclosure,
copying or distribution of this information or the taking of any action in reliance on the
contents of this transmission is strictly prohibited. If you have received this transmission
in error, please notify us immediately.