Having run some more benchmarks, the new default is *really* bad for our
application (2-10x slower), so I've been looking at the source to try
and figure out why.
It seems that the biggest difference will occur when the all_to_all is
actually sparse (e.g. our application); if most N-M process exchanges
are zero in size the 1.6 ompi_coll_tuned_alltoallv_intra_basic_linear
algorithm will actually only post irecv/isend for non-zero exchanges;
any zero-size exchanges are skipped. It then waits once for all requests
to complete. In contrast, the new
ompi_coll_tuned_alltoallv_intra_pairwise will post the zero-size
exchanges for *every* N-M pair, and wait for each pairwise exchange.
This is O(comm_size) waits, may of which are zero. I'm not clear what
optimizations there are for zero-size isend/irecv, but surely there's a
great deal more latency if each pairwise exchange has to be confirmed
complete before executing the next?
Relatedly, how would I direct OpenMPI to use the older algorithm
programmatically? I don't want the user to have to use "--mca" in their
"mpiexec". Is there a C API?
On 16/11/12 10:15, Iliev, Hristo wrote:
> Hi, Simon,
> The pairwise algorithm passes messages in a synchronised ring-like fashion
> with increasing stride, so it works best when independent communication
> paths could be established between several ports of the network
> switch/router. Some 1 Gbps Ethernet equipment is not capable of doing so,
> some is - it depends (usually on the price). This said, not all algorithms
> perform the same given a specific type of network interconnect. For example,
> on our fat-tree InfiniBand network the pairwise algorithm performs better.
> You can switch back to the basic linear algorithm by providing the following
> MCA parameters:
> mpiexec --mca coll_tuned_use_dynamic_rules 1 --mca
> coll_tuned_alltoallv_algorithm 1 ...
> Algorithm 1 is the basic linear, which used to be the default. Algorithm 2
> is the pairwise one.
> You can also set these values as exported environment variables:
> export OMPI_MCA_coll_tuned_use_dynamic_rules=1
> export OMPI_MCA_coll_tuned_alltoallv_algorithm=1
> mpiexec ...
> You can also put this in $HOME/.openmpi/mcaparams.conf or (to make it have
> global effect) in $OPAL_PREFIX/etc/openmpi-mca-params.conf:
> A gratuitous hint: dual-Opteron systems are NUMAs so it makes sense to
> activate process binding with --bind-to-core if you haven't already did so.
> It prevents MPI processes from being migrated to other NUMA nodes while
> Kind regards,
> Hristo Iliev, Ph.D. -- High Performance Computing
> RWTH Aachen University, Center for Computing and Communication
> Rechen- und Kommunikationszentrum der RWTH Aachen
> Seffenter Weg 23, D 52074 Aachen (Germany)
>> -----Original Message-----
>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
>> On Behalf Of Number Cruncher
>> Sent: Thursday, November 15, 2012 5:37 PM
>> To: Open MPI Users
>> Subject: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1
>> I've noticed a very significant (100%) slow down for MPI_Alltoallv calls
> as of
>> version 1.6.1.
>> * This is most noticeable for high-frequency exchanges over 1Gb ethernet
>> where process-to-process message sizes are fairly small (e.g. 100kbyte)
>> much of the exchange matrix is sparse.
>> * 1.6.1 release notes mention "Switch the MPI_ALLTOALLV default algorithm
>> to a pairwise exchange", but I'm not clear what this means or how to
>> back to the old "non-default algorithm".
>> I attach a test program which illustrates the sort of usage in our MPI
>> application. I have run as this as 32 processes on four nodes, over 1Gb
>> ethernet, each node with 2x Opteron 4180 (dual hex-core); rank 0,4,8,..
>> on node 1, rank 1,5,9, ... on node 2, etc.
>> It constructs an array of integers and a nProcess x nProcess exchange
>> of part of our application. This is then exchanged several thousand times.
>> Output from "mpicc -O3" runs shown below.
>> My guess is that 1.6.1 is hitting additional latency not present in 1.6.0.
> I also
>> attach a plot showing network throughput on our actual mesh generation
>> application. Nodes cfsc01-04 are running 1.6.0 and finish within 35
>> Nodes cfsc05-08 are running 1.6.2 (started 10 minutes later) and take over
>> hour to run. There seems to be a much greater network demand in the 1.6.1
>> version, despite the user-code and input data being identical.
>> Thanks for any help you can give,