I completely understand there's no "one
size fits all", and I appreciate that there are workarounds to the
change in behaviour. I'm only trying to make what little
contribution I can by questioning the architecture of the pairwise
I know that for every user you please, there will be some that
aren't happy when a default changes (Windows 8 anyone?), but I'm
trying to provide some real-world data. If 90% of apps are 10%
faster and 10% are 1000% slower, should the default change?
all_to_all is a really nice primitive from a developer point of
view. Every process' code is symmetric and identical. Maybe I
should have to worry that most of the matrix exchange is sparse; I
probably could calculate an optimal exchange pattern. But I think
this is the implementation's job, and I will continue to argue
that *waiting* for each pairwise exchange is (a) unnecessary, (b)
doesn't improve performance for *any* application and (c) at worst
causes huge slowdown over the last algorithm for sparse cases.
In summary: I'm arguing that there's a problem with the pairwise
implementation as it stands. It doesn't have any optimization for
sparse all_to_all and imposes unnecessary synchronisation barriers
in all cases.
On 20/12/2012 14:42, Iliev, Hristo wrote:
The goal of any MPI implementation is to be as fast as possible.
Unfortunately there is no "one size fits all" algorithm that works on all
networks and given all possible kind of peculiarities that your specific
communication scheme may have. That's why there are different algorithms and
you are given the option to dynamically select them at run time without the
need to recompile the code. I don't think the change of the default
algorithm (note that the pairwise algorithm has been there for many years -
it is not new, simply the new default one) was introduced in order to piss
If you want OMPI to default to the previous algorithm:
1) Add this to the system-wide OMPI configuration file
$sysconf/openmpi-mca-params.conf (wher $sysconf would most likely be
$PREFIX/etc, where $PREFIX is the OMPI installation directory):
coll_tuned_use_dynamic_rules = 1
coll_tuned_alltoallv_algorithm = 1
2) The settings from (1) can be overridden on per user basis by the similar
settings from $HOME/.openmpi/mca-params.conf.
3) The settings from (1) and (2) can be overridden on per job basis by
exporting MCA parameters as environment variables:
4) Finally, the settings from (1), (2), and (3) can be overridden on per MPI
program launch by supplying appropriate MCA parameters to orterun (a.k.a.
mpirun and mpiexec).
There is also a largely undocumented feature of the "tuned" collective
component where a dynamic rules file can be supplied. In the file a series
of cases tell the library which implementation to use based on the
communicator and message sizes. No idea if it works with ALLTOALLV.
(sorry for top posting - damn you, Outlook!)
Hristo Iliev, Ph.D. -- High Performance Computing
RWTH Aachen University, Center for Computing and Communication
Rechen- und Kommunikationszentrum der RWTH Aachen
Seffenter Weg 23, D 52074 Aachen (Germany)
From: email@example.com [mailto:firstname.lastname@example.org]
On Behalf Of Number Cruncher
Sent: Wednesday, December 19, 2012 5:31 PM
To: Open MPI Users
Subject: Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to
On 19/12/12 11:08, Paul Kapinos wrote:
Did you *really* wanna to dig into code just in order to switch a
default communication algorithm?
No, I didn't want to, but with a huge change in performance, I'm forced to
something! And having looked at the different algorithms, I think there's
problem with the new default whenever message sizes are small enough
that connection latency will dominate. We're not all running Infiniband,
having to wait for each pairwise exchange to complete before initiating
another seems wrong if the latency in waiting for completion dominates the
E.g. If I have 10 small pairwise exchanges to perform,isn't it better to
10 outbound messages on the wire, and wait for 10 matching inbound
messages, in any order? The new algorithm must wait for first exchange to
complete, then second exchange, then third. Unlike before, does it not
to wait to acknowledge the matching *zero-sized* request? I don't see why
this temporal ordering matters.
Thanks for your help,
Note there are several ways to set the parameters; --mca on command
line is just one of them (suitable for quick online tests).
We 'tune' our Open MPI by setting environment variables....
On 12/19/12 11:44, Number Cruncher wrote:
Having run some more benchmarks, the new default is *really* bad for
our application (2-10x slower), so I've been looking at the source to
try and figure out why.
It seems that the biggest difference will occur when the all_to_all
is actually sparse (e.g. our application); if most N-M process
exchanges are zero in size the 1.6
ompi_coll_tuned_alltoallv_intra_basic_linear algorithm will actually
only post irecv/isend for non-zero exchanges; any zero-size exchanges
are skipped. It then waits once for all requests to complete. In
contrast, the new ompi_coll_tuned_alltoallv_intra_pairwise will post
the zero-size exchanges for
*every* N-M pair, and wait for each pairwise exchange. This is
waits, may of which are zero. I'm not clear what optimizations there
are for zero-size isend/irecv, but surely there's a great deal more
latency if each pairwise exchange has to be confirmed complete before
executing the next?
Relatedly, how would I direct OpenMPI to use the older algorithm
programmatically? I don't want the user to have to use "--mca" in
their "mpiexec". Is there a C API?
On 16/11/12 10:15, Iliev, Hristo wrote:
The pairwise algorithm passes messages in a synchronised ring-like
fashion with increasing stride, so it works best when independent
communication paths could be established between several ports of
the network switch/router. Some 1 Gbps Ethernet equipment is not
capable of doing so, some is - it depends (usually on the price).
This said, not all algorithms perform the same given a specific type
of network interconnect. For example, on our fat-tree InfiniBand
network the pairwise algorithm performs better.
You can switch back to the basic linear algorithm by providing the
following MCA parameters:
mpiexec --mca coll_tuned_use_dynamic_rules 1 --mca
coll_tuned_alltoallv_algorithm 1 ...
Algorithm 1 is the basic linear, which used to be the default.
is the pairwise one.
You can also set these values as exported environment variables:
You can also put this in $HOME/.openmpi/mcaparams.conf or (to make
it have global effect) in $OPAL_PREFIX/etc/openmpi-mca-params.conf:
A gratuitous hint: dual-Opteron systems are NUMAs so it makes sense
to activate process binding with --bind-to-core if you haven't
already did so.
It prevents MPI processes from being migrated to other NUMA nodes
Hristo Iliev, Ph.D. -- High Performance Computing RWTH Aachen
University, Center for Computing and Communication
Rechen- und Kommunikationszentrum der RWTH Aachen Seffenter Weg
D 52074 Aachen (Germany)
On Behalf Of Number Cruncher
Sent: Thursday, November 15, 2012 5:37 PM
To: Open MPI Users
Subject: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to
I've noticed a very significant (100%) slow down for MPI_Alltoallv
* This is most noticeable for high-frequency exchanges over 1Gb
ethernet where process-to-process message sizes are fairly small
much of the exchange matrix is sparse.
* 1.6.1 release notes mention "Switch the MPI_ALLTOALLV default
algorithm to a pairwise exchange", but I'm not clear what this
means or how to
back to the old "non-default algorithm".
I attach a test program which illustrates the sort of usage in our
MPI application. I have run as this as 32 processes on four nodes,
over 1Gb ethernet, each node with 2x Opteron 4180 (dual hex-core);
on node 1, rank 1,5,9, ... on node 2, etc.
It constructs an array of integers and a nProcess x nProcess
of part of our application. This is then exchanged several thousand
Output from "mpicc -O3" runs shown below.
My guess is that 1.6.1 is hitting additional latency not present in
attach a plot showing network throughput on our actual mesh
generation application. Nodes cfsc01-04 are running 1.6.0 and
finish within 35
Nodes cfsc05-08 are running 1.6.2 (started 10 minutes later) and
hour to run. There seems to be a much greater network demand in the
version, despite the user-code and input data being identical.
Thanks for any help you can give,
users mailing list
users mailing list
users mailing list