Jonathan seems almost perfect; I percieve the same.
On Fri, Nov 6, 2009 at 6:17 PM, Tom Rosmond <rosmond_at_[hidden]> wrote:
> On your first question, the answer is probably, if everything else is
> done correctly. The first test is to not try to do the overlapping
> communication and computation, but do them sequentially and make sure
> the answers are correct. Have you done this test? Debugging your
> original approach will be challenging, and having a control solution
> will be a big help.
I followed the path of sequentional---then--parallel blocking----and then
My serial solution is the control solution.
> On your second question, if I understand it correctly, is that it is
> always better to minimize the number of messages. In problems like this
> communication costs are dominated by latency, so bundling the data into
> the fewest possible messages will ALWAYS be better.
But what pointed out by Jonathan:
If you really do hide most of the communications cost with your non-blocking
communications, then it may not matter too much.
is the point I want to be sure about.
> T. Rosmond
> On Fri, 2009-11-06 at 17:44 -0500, amjad ali wrote:
> > Hi all,
> > I need/request some help from those who have some experience in
> > debugging/profiling/tuning parallel scientific codes, specially for
> > PDEs/CFD.
> > I have parallelized a Fortran CFD code to run on
> > Ethernet-based-Linux-Cluster. Regarding MPI communication what I do is
> > that:
> > Suppose that the grid/mesh is decomposed for n number of processors,
> > such that each processors has a number of elements that share their
> > side/face with different processors. What I do is that I start non
> > blocking MPI communication at the partition boundary faces (faces
> > shared between any two processors) , and then start computing values
> > on the internal/non-shared faces. When I complete this computation, I
> > put WAITALL to ensure MPI communication completion. Then I do
> > computation on the partition boundary faces (shared-ones). This way I
> > try to hide the communication behind computation. Is it correct?
> > IMPORTANT: Secondly, if processor A shares 50 faces (on 50 or less
> > elements) with an another processor B then it sends/recvs 50 different
> > messages. So in general if a processors has X number of faces sharing
> > with any number of other processors it sends/recvs that much messages.
> > Is this way has "very much reduced" performance in comparison to the
> > possibility that processor A will send/recv a single-bundle message
> > (containg all 50-faces-data) to process B. Means that in general a
> > processor will only send/recv that much messages as the number of
> > processors neighbour to it. It will send a single bundle/pack of
> > messages to each neighbouring processor.
> > Is their "quite a much difference" between these two approaches?
> > THANK YOU VERY MUCH.
> > AMJAD.
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> users mailing list