I found the info I think you were referring to. Thanks. I then experimented
essentially randomly with different algorithms for all reduce. But the issue
with really bad performance for certain message sizes persisted with v1.1.
The good news is that the upgrade to 1.2 fixed my worst problem. Now the
performance is reasonable for all message sizes. I will test the tuned
algorithms again asap.
I had a couple of questions
1) Ompi_info lists only 3 or 4 algorithms for allreduce and reduce and about
5 for b'cast. But you can use higher numbers as well. Are these additional
undocmented algorithms (you mentioned a number like 15) or is it ignoring
out of range parameters?
2) It seems for allreduce you can select a tuned reduce and tuned bcast
instead of the binary tree. But there is a faster allreduce which is order
2N rather than 4N for Reduce + Bcast (N is msg size). It segments the vector
and distributes the root among the nodes; in an allreduce there is no need
to gather the root vector to one processor and then scatter it again. I
wrote a simple version for powers of 2 (MPI_SUM)-any chance of it being
implemented in OMPI.