I need/request some help from those who have some experience in
debugging/profiling/tuning parallel scientific codes, specially for
I have parallelized a Fortran CFD code to run on
Ethernet-based-Linux-Cluster. Regarding MPI communication what I do is that:
Suppose that the grid/mesh is decomposed for n number of processors, such
that each processors has a number of elements that share their side/face
with different processors. What I do is that I start non blocking MPI
communication at the partition boundary faces (faces shared between any two
processors) , and then start computing values on the internal/non-shared
faces. When I complete this computation, I put WAITALL to ensure MPI
communication completion. Then I do computation on the partition boundary
faces (shared-ones). This way I try to hide the communication behind
computation. Is it correct?
IMPORTANT: Secondly, if processor A shares 50 faces (on 50 or less elements)
with an another processor B then it sends/recvs 50 different messages. So in
general if a processors has X number of faces sharing with any number of
other processors it sends/recvs that much messages. Is this way has "very
much reduced" performance in comparison to the possibility that processor A
will send/recv a single-bundle message (containg all 50-faces-data) to
process B. Means that in general a processor will only send/recv that much
messages as the number of processors neighbour to it. It will send a single
bundle/pack of messages to each neighbouring processor.
Is their "quite a much difference" between these two approaches?
THANK YOU VERY MUCH.