Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Programming Help needed
From: Tom Rosmond (rosmond_at_[hidden])
Date: 2009-11-06 18:17:16


On your first question, the answer is probably, if everything else is
done correctly. The first test is to not try to do the overlapping
communication and computation, but do them sequentially and make sure
the answers are correct. Have you done this test? Debugging your
original approach will be challenging, and having a control solution
will be a big help.

On your second question, if I understand it correctly, is that it is
always better to minimize the number of messages. In problems like this
communication costs are dominated by latency, so bundling the data into
the fewest possible messages will ALWAYS be better.

T. Rosmond

On Fri, 2009-11-06 at 17:44 -0500, amjad ali wrote:
> Hi all,
> I need/request some help from those who have some experience in
> debugging/profiling/tuning parallel scientific codes, specially for
> I have parallelized a Fortran CFD code to run on
> Ethernet-based-Linux-Cluster. Regarding MPI communication what I do is
> that:
> Suppose that the grid/mesh is decomposed for n number of processors,
> such that each processors has a number of elements that share their
> side/face with different processors. What I do is that I start non
> blocking MPI communication at the partition boundary faces (faces
> shared between any two processors) , and then start computing values
> on the internal/non-shared faces. When I complete this computation, I
> put WAITALL to ensure MPI communication completion. Then I do
> computation on the partition boundary faces (shared-ones). This way I
> try to hide the communication behind computation. Is it correct?
> IMPORTANT: Secondly, if processor A shares 50 faces (on 50 or less
> elements) with an another processor B then it sends/recvs 50 different
> messages. So in general if a processors has X number of faces sharing
> with any number of other processors it sends/recvs that much messages.
> Is this way has "very much reduced" performance in comparison to the
> possibility that processor A will send/recv a single-bundle message
> (containg all 50-faces-data) to process B. Means that in general a
> processor will only send/recv that much messages as the number of
> processors neighbour to it. It will send a single bundle/pack of
> messages to each neighbouring processor.
> Is their "quite a much difference" between these two approaches?
> _______________________________________________
> users mailing list
> users_at_[hidden]