Hi Cole, Jed
I don't have much direct experience with PETSc.
I mostly troubleshooted other people's PETSc programs,
and observed their performance.
What I noticed is:
1) PETSc's learning curve is as steep if not steeper than MPI, and
2) PETSc codes seem to be slower (or have more overhead)
than codes written directly in MPI.
Jed seems to have a different perception of PETSc, though,
and is more enthusiastic about it.
Admittedly, I don't have any direct comparison
(i.e. the same exact code implemented via PETSc and via MPI),
to support what I said above.
However, it is true that PETSc comes with a big toolbox,
and a large number of solvers.
PETSc is also very flexible, you can attach a variety of other
packages to it (most linear algebra stuff, sparse matrices, etc, etc).
The PETSc fans here love its ability to
prototype complex algorithms in relatively little time.
I don't know if it should be used for production code, though,
and I don't even know if your code is meant to run efficiently
in production mode.
OTOH, if you have a clean and good serial code already developed,
I think it won't be a big deal to parallelize it directly
with MPI, assuming that the core algorithm (your Gauss-Seidel solver)
fits the remaining code in a well structured way.
As per the previous discussion on this thread,
if you don't plan to use a very large number of processors,
you can parallelize in X "books"
(see the IU and LLNL diffusion equation example I sent you),
which is probably the simplest approach,
and still get a decent performance.
(For many processors / small subdomains, use XY "pencils instead.)
With X "book" subdomains you won't need to create MPI types.
You can use the very array sections (pointer and size, assuming
they sit in contiguous memory) in the MPI functions.
To read/write initial/final data you can use a master/slave scheme,
where process rank 0 (master) reads the whole thing,
then use MPI_Scatter[v] to distribute the data
to the other process ranks.
(Rank 0 can still do part of
the computational work also.)
At the end, all processes use MPI_Gather[v] to return the
results to rank 0, who writes the results to disk.
At each time step you exchange halo/ghost sections across
neighbor subdomains, using MPI_Send/MPI_Recv,
Even better if you use non-blocking calls
Read about the advantages of non-blocking communication
in the "MPI The Complete Reference, Vol 1" book that I suggested
You can do the bookkeeping of "which subdomain/process_rank is my
left neighbor?" etc, yourself, if you create domain neighbor
tables when the program initializes.
Alternatively, and more elegantly, you can use the MPI
Cartesian topology functions to take care of this for you.
The description above is probably a very simple minded approach to
However, it should work with a decent performance,
and won't take too much effort or time to develop.
I hope this helps.
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
Cole, Derek E wrote:
> I actually am only working on this code because it is what
someone wants. I would have probably chosen a
different solver as well.
We do have some vector machines though that this will probably run on.
> I did a lot of memory work already in the serial code to speed
it up pretty significantly.
I also have to a little careful of what other
libraries are introduced.
I am reading up on PETSc to see how flexible it is.
> Thanks for the responses
> -----Original Message-----
> From: Jed Brown [mailto:five9a2_at_[hidden]] On Behalf Of Jed Brown
> Sent: Thursday, March 11, 2010 1:00 PM
> To: Cole, Derek E; users_at_[hidden]
> Subject: Re: [OMPI users] 3D domain decomposition with MPI
> On Thu, 11 Mar 2010 12:44:01 -0500, "Cole, Derek E" <derek.e.cole_at_[hidden]> wrote:
>> I am replying to this via the daily-digest message I got. Sorry it
>> wasn't sooner... I didn't realize I was getting replies until I got
>> the digest. Does anyone know how to change it so I get the emails as
>> you all send them?
> Log in at the bottom and edit options:
>> I am doing a Red-black Gauss-Seidel algorithm.
> Note that red-black Guss-Seidel is a terrible algorithm on cache-based hardware, it only makes sense on vector hardware. The reason for this is that the whole point is to approximate a dense action (the inverse of a matrix), but the red-black ordering causes this action to be purely local. A sequential ordering, on the other hand, is like a dense lower-triangular operation, which tends to be much closer to a real inverse. In parallel, you do these sequential sweeps on each process, and communicate when you're done. The memory performance will be twice as good, and the algorithm will converge in fewer iterations.
>> The ghost points are what I was trying to figure for moving this into
>> the 3rd dimension. Thanks for adding some concrete-ness to my idea of
>> exactly how much overhead is involved. The test domains I am computing
>> on are on the order of 100*50*50 or so. This is why I am trying to
>> limit the overhead of the MPI communication. I am in the process of
>> finding out exactly how big the domains may become, so that I can
>> adjust the algorithm accordingly.
> The difficulty is for small subdomains. If you have large subdomains, then parallel overhead will almost always be small.
>> If I understand what you mean by pencils versus books, I don't think
>> that will work for these types of calculations will it? Maybe a better
>> explanation of what you mean by a pencil versus a book. If I was going
>> to solve a sub-matrix of the XY plane for all Z-values, what is that
> That would be a "book" or "slab".
> I still recommend using PETSc rather than reproducing standard code to call MPI directly for this, it will handle all the decomposition and updates, and has the advantage that you'll be able to use much better algorithms than Gauss-Seidel.
> users mailing list