Scaling tests over the last few months have all shown a behavior that has
elicited significant comment: namely, that the HNP is observed to grow to
multiple gigabytes in size for runs involving several thousand processes.
This represents a peak size that declines to a much smaller footprint once
the application has been launched.
Given the degree of concern expressed over this behavior, I thought I would
once again provide the explanation for it. I believe I have sent emails out
about this before, but I know it can be difficult for people who don't work
regularly with OpenRTE to remember those notes after time has passed.
The observed memory spike is caused by the way we handle the STG1 stage gate
message sent to all processes. There are two contributing factors that
specifically control this behavior on the current Open MPI releases and
1. we send stage gate messages directly to each process. Thus, for N
processes, there are N messages queued for transmission at the HNP; and
2. we use non-blocking RML/OOB send commands to do the communication. Note
that we used to do blocking sends, but for speed purposes converted over to
non-blocking sends late last year. This is a critical point to understanding
the behavior, as you'll see in a moment.
The key to the memory spike lies in knowing that the RML/OOB actually
*copies* the buffer given to it for transmission, then inserts the comm
request into its queue for transmission as network access permits. When we
used blocking sends, we only had *one* message in the queue at any time -
hence, the memory footprint of the HNP remained small. However, when we
converted to non-blocking sends, we have N messages in the queue. Thus,
there are now *N* copies of the message buffer being made inside the HNP!
As transmission of each message is completed, the corresponding copy of the
data is released. Hence, the HNP's footprint gradually reduces as the
communication is completed. Once the STG1 stage gate is passed, the
footprint is back to a relatively small number.
One could question why the copy is being done at all. Well, when the
original author of the RML/OOB wrote that code, he was concerned that
callers might not retain the provided message buffer until *after* the
communication had been completed. This was particularly of concern for
non-blocking sends since the send call immediately returns, but the message
may not actually be sent for some unknowable time into the future.
In addition, there are numerous places in the code where someone will create
a single message buffer and then send it to multiple recipients using
non-blocking sends. This buffer is then released once the send commands have
been *issued* - but that doesn't mean that the messages have actually been
sent! Of course, we could require that the buffer be retained until the
communication is complete, but that would add complexity to the code in the
caller's routine - and we opted to avoid that necessity.
Of course, we can revisit these design decisions in light of how we are
currently using the system. Perhaps we *should* require the caller to
maintain the buffer throughout the communication, and force the caller to
deal with the associated code complexity.
Note that the obvious solution of just creating a new buffer for each send
and then releasing it in the corresponding callback function would solve
nothing - we would just be moving the copy function from the RML/OOB to the
caller's function. I have seen this done in a few places in the code, but
all that did was cause us to generate *two* copies of each message. So we
would have to rely on the caller to be clever about buffer management to
make any such change work.
Anyway, that is why the HNP is behaving as observed. Please note that this
will automatically improve once we turn "on" the more advanced xcast modes
as the number of messages being queued at the HNP will dramatically decline.
It won't change how the RML/OOB work, but it will reduce the footprint
I hope that helps clarify and, perhaps, generate some useful thoughts on