Brock Palen wrote:
>Yeah, I know what you mean about if you have to use a 'all to all'
>use MPI_Alltoall() don't roll your own.
>So on paper, alltoall at first glance appears to be: n*(n-1) -> n^2-
>n -> n^2 (for large n).
>Allreduce appears to be simplest, n point to points followed by a
>bcast(). Which can be simplified to gather + bcast.
>Last I knew MPI_Bcast() was log(n) and gather is (n). So for
>allreduce I get:
>I guess I am confused how to get alltoall() down from n^2.
It's important to track both time complexity and space complexity.
Collective algorithms have to be functions of message size, rank
count, and communications latency vs bandwidth tradeoffs. Probably
the MPI implementation knows better than the application programmer, but
that isn't always true...
It is only for "sufficiently large messages" that an n**2 implementation
of Alltoall makes sense. For example, an Alltoall could be thought of
as N simultaneous Gather operations (or N scatter). When the size of
the items being sent using Alltoall is small, it makes sense to "forward" items through multiple other ranks before they get to where
they are going. If the size of items going from each rank to each
other rank is "x", then Alltoall can be done in logN rounds where in
each round each rank sends one message of size Nx. The total data
communicated with this design is logN*N*Nx, but the total number
of messages is only NlogN, rather than N**2. In the limit of large "x"
of course it is more efficient to send N**2 messages. The
tipping point is when the data is large enough that the runtime is
bandwidth dominated rather than latency dominated.
There's a nice paper by the folks at Sandia about using this technique
for HPCC Random Access, and another one by John Mellor-Crummey
Using the same idea, Gather for small messages is implemented like
Reduce with operation "concatenate" and can be done in log steps.
-Larry / Sector IX