v1.1 does not have the tuned collective (I think but now I'm not 100%
sure anymore), or at least they were not active by default. The first
version with the tuned collective will be 1.2. The current decision
function (from the nightly builds) target high performance networks
with 2 characteristics: low latency (4-5 micro-sec) and high
bandwidth (over 1Gb/s).
There are several implementations for each of the algorithms. Some
are wired and some are not. The most difficult part is to make sure
each of these implementations is correct (from MPI point of view) and
give the expected answer in all circumstances. More functions we
have, more tests we have to perform, and right now that's the main
limitation. We have other algorithms implemented which are not in the
Open MPI right now. They will come as soon as they get tested well
enough in order for us to feel confident about their correctness.
Here are the answers:
1. Not all algorithms are wired to be showed by ompi_info. Everything
out of range is set to the default value which means the current
2. The Allreduce algorithms are coming soon. Btw, all algorithms
inside Open MPi support segmentation and all of the tree based one,
support a fanout input (number of children).
Time is the only thing we're missing right now ... i.e. the weeks
(now without the s) before SC.
On Nov 2, 2006, at 11:00 PM, Tony Ladd wrote:
> I found the info I think you were referring to. Thanks. I then
> essentially randomly with different algorithms for all reduce. But
> the issue
> with really bad performance for certain message sizes persisted
> with v1.1.
> The good news is that the upgrade to 1.2 fixed my worst problem.
> Now the
> performance is reasonable for all message sizes. I will test the tuned
> algorithms again asap.
> I had a couple of questions
> 1) Ompi_info lists only 3 or 4 algorithms for allreduce and reduce
> and about
> 5 for b'cast. But you can use higher numbers as well. Are these
> undocmented algorithms (you mentioned a number like 15) or is it
> out of range parameters?
> 2) It seems for allreduce you can select a tuned reduce and tuned
> instead of the binary tree. But there is a faster allreduce which
> is order
> 2N rather than 4N for Reduce + Bcast (N is msg size). It segments
> the vector
> and distributes the root among the nodes; in an allreduce there is
> no need
> to gather the root vector to one processor and then scatter it
> again. I
> wrote a simple version for powers of 2 (MPI_SUM)-any chance of it
> implemented in OMPI.
> users mailing list