On your first question, the answer is probably, if everything else is
done correctly. The first test is to not try to do the overlapping
communication and computation, but do them sequentially and make sure
the answers are correct. Have you done this test? Debugging your
original approach will be challenging, and having a control solution
will be a big help.
On your second question, if I understand it correctly, is that it is
always better to minimize the number of messages. In problems like this
communication costs are dominated by latency, so bundling the data into
the fewest possible messages will ALWAYS be better.
On Fri, 2009-11-06 at 17:44 -0500, amjad ali wrote:
> Hi all,
> I need/request some help from those who have some experience in
> debugging/profiling/tuning parallel scientific codes, specially for
> I have parallelized a Fortran CFD code to run on
> Ethernet-based-Linux-Cluster. Regarding MPI communication what I do is
> Suppose that the grid/mesh is decomposed for n number of processors,
> such that each processors has a number of elements that share their
> side/face with different processors. What I do is that I start non
> blocking MPI communication at the partition boundary faces (faces
> shared between any two processors) , and then start computing values
> on the internal/non-shared faces. When I complete this computation, I
> put WAITALL to ensure MPI communication completion. Then I do
> computation on the partition boundary faces (shared-ones). This way I
> try to hide the communication behind computation. Is it correct?
> IMPORTANT: Secondly, if processor A shares 50 faces (on 50 or less
> elements) with an another processor B then it sends/recvs 50 different
> messages. So in general if a processors has X number of faces sharing
> with any number of other processors it sends/recvs that much messages.
> Is this way has "very much reduced" performance in comparison to the
> possibility that processor A will send/recv a single-bundle message
> (containg all 50-faces-data) to process B. Means that in general a
> processor will only send/recv that much messages as the number of
> processors neighbour to it. It will send a single bundle/pack of
> messages to each neighbouring processor.
> Is their "quite a much difference" between these two approaches?
> THANK YOU VERY MUCH.
> users mailing list