Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] MPI_Allreduce()
From: Lawrence Stewart (larry.stewart_at_[hidden])
Date: 2008-03-13 10:59:51

Brock Palen wrote:

>Yeah, I know what you mean about if you have to use a 'all to all'
>use MPI_Alltoall() don't roll your own.
>So on paper, alltoall at first glance appears to be: n*(n-1) -> n^2-
>n -> n^2 (for large n).
>Allreduce appears to be simplest, n point to points followed by a
>bcast(). Which can be simplified to gather + bcast.
>Last I knew MPI_Bcast() was log(n) and gather is (n). So for
>allreduce I get:
>I guess I am confused how to get alltoall() down from n^2.
It's important to track both time complexity and space complexity.
Collective algorithms have to be functions of message size, rank
count, and communications latency vs bandwidth tradeoffs. Probably
the MPI implementation knows better than the application programmer, but
that isn't always true...

Rgarding Alltoall...
It is only for "sufficiently large messages" that an n**2 implementation
of Alltoall makes sense. For example, an Alltoall could be thought of
as N simultaneous Gather operations (or N scatter). When the size of
the items being sent using Alltoall is small, it makes sense to "forward" items through multiple other ranks before they get to where
they are going. If the size of items going from each rank to each
other rank is "x", then Alltoall can be done in logN rounds where in
each round each rank sends one message of size Nx. The total data
communicated with this design is logN*N*Nx, but the total number
of messages is only NlogN, rather than N**2. In the limit of large "x"
of course it is more efficient to send N**2 messages. The
tipping point is when the data is large enough that the runtime is
bandwidth dominated rather than latency dominated.

There's a nice paper by the folks at Sandia about using this technique
for HPCC Random Access, and another one by John Mellor-Crummey

Using the same idea, Gather for small messages is implemented like
Reduce with operation "concatenate" and can be done in log steps.

-Larry / Sector IX