> What I do is that I start non
> blocking MPI communication at the partition boundary faces (faces shared
> between any two processors) , and then start computing values on the
> internal/non-shared faces. When I complete this computation, I put
> WAITALL to ensure MPI communication completion. Then I do computation on
> the partition boundary faces (shared-ones). This way I try to hide the
> communication behind computation. Is it correct?
As long as your numerical method allows you to do this (that is, you
definitely don't need those boundary values to compute the internal
values), then yes, this approach can hide some of the communication
costs very effectively. The way I'd program this if I were doing it
from scratch would be to do the usual blocking approach (no one computes
anything until all the faces are exchanged) first and get that working,
then break up the computation step into internal and boundary
computations and make sure it still works, and then change the messaging
to isends/irecvs/waitalls, and make sure it still works, and only then
interleave the two.
> IMPORTANT: Secondly, if processor A shares 50 faces (on 50 or less
> elements) with an another processor B then it sends/recvs 50 different
> messages. So in general if a processors has X number of faces sharing
> with any number of other processors it sends/recvs that much messages.
> Is this way has "very much reduced" performance in comparison to the
> possibility that processor A will send/recv a single-bundle message
> (containg all 50-faces-data) to process B. Means that in general a
> processor will only send/recv that much messages as the number of
> processors neighbour to it. It will send a single bundle/pack of
> messages to each neighbouring processor.
> Is their "quite a much difference" between these two approaches?
Your individual element faces that are being communicated are likely
quite small. It is quite generally the case that bundling many small
messages into large messages can significantly improve performance, as
you avoid incurring the repeated latency costs of sending many messages.
As always, though, the answer is `it depends', and the only way to know
is to try it both ways. If you really do hide most of the
communications cost with your non-blocking communications, then it may
not matter too much. In addition, if you don't know beforehand how much
data you need to send/receive, then you'll need a handshaking step which
introduces more synchronization and may actually hurt performance, or
you'll have to use MPI2 one-sided communications. On the other hand,
if this shared boundary doesn't change through the simulation, you could
just figure out at start-up time how big the messages will be between
neighbours and use that as the basis for the usual two-sided messages.
My experience is that there's an excellent chance you'll improve the
performance by packing the little messages into fewer larger messages.
Jonathan Dursi <ljdursi_at_[hidden]>