Did you *really* wanna to dig into code just in order to switch a default
Note there are several ways to set the parameters; --mca on command line is just
one of them (suitable for quick online tests).
We 'tune' our Open MPI by setting environment variables....
On 12/19/12 11:44, Number Cruncher wrote:
> Having run some more benchmarks, the new default is *really* bad for our
> application (2-10x slower), so I've been looking at the source to try and figure
> out why.
> It seems that the biggest difference will occur when the all_to_all is actually
> sparse (e.g. our application); if most N-M process exchanges are zero in size
> the 1.6 ompi_coll_tuned_alltoallv_intra_basic_linear algorithm will actually
> only post irecv/isend for non-zero exchanges; any zero-size exchanges are
> skipped. It then waits once for all requests to complete. In contrast, the new
> ompi_coll_tuned_alltoallv_intra_pairwise will post the zero-size exchanges for
> *every* N-M pair, and wait for each pairwise exchange. This is O(comm_size)
> waits, may of which are zero. I'm not clear what optimizations there are for
> zero-size isend/irecv, but surely there's a great deal more latency if each
> pairwise exchange has to be confirmed complete before executing the next?
> Relatedly, how would I direct OpenMPI to use the older algorithm
> programmatically? I don't want the user to have to use "--mca" in their
> "mpiexec". Is there a C API?
> On 16/11/12 10:15, Iliev, Hristo wrote:
>> Hi, Simon,
>> The pairwise algorithm passes messages in a synchronised ring-like fashion
>> with increasing stride, so it works best when independent communication
>> paths could be established between several ports of the network
>> switch/router. Some 1 Gbps Ethernet equipment is not capable of doing so,
>> some is - it depends (usually on the price). This said, not all algorithms
>> perform the same given a specific type of network interconnect. For example,
>> on our fat-tree InfiniBand network the pairwise algorithm performs better.
>> You can switch back to the basic linear algorithm by providing the following
>> MCA parameters:
>> mpiexec --mca coll_tuned_use_dynamic_rules 1 --mca
>> coll_tuned_alltoallv_algorithm 1 ...
>> Algorithm 1 is the basic linear, which used to be the default. Algorithm 2
>> is the pairwise one.
>> You can also set these values as exported environment variables:
>> export OMPI_MCA_coll_tuned_use_dynamic_rules=1
>> export OMPI_MCA_coll_tuned_alltoallv_algorithm=1
>> mpiexec ...
>> You can also put this in $HOME/.openmpi/mcaparams.conf or (to make it have
>> global effect) in $OPAL_PREFIX/etc/openmpi-mca-params.conf:
>> A gratuitous hint: dual-Opteron systems are NUMAs so it makes sense to
>> activate process binding with --bind-to-core if you haven't already did so.
>> It prevents MPI processes from being migrated to other NUMA nodes while
>> Kind regards,
>> Hristo Iliev, Ph.D. -- High Performance Computing
>> RWTH Aachen University, Center for Computing and Communication
>> Rechen- und Kommunikationszentrum der RWTH Aachen
>> Seffenter Weg 23, D 52074 Aachen (Germany)
>>> -----Original Message-----
>>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
>>> On Behalf Of Number Cruncher
>>> Sent: Thursday, November 15, 2012 5:37 PM
>>> To: Open MPI Users
>>> Subject: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1
>>> I've noticed a very significant (100%) slow down for MPI_Alltoallv calls
>> as of
>>> version 1.6.1.
>>> * This is most noticeable for high-frequency exchanges over 1Gb ethernet
>>> where process-to-process message sizes are fairly small (e.g. 100kbyte)
>>> much of the exchange matrix is sparse.
>>> * 1.6.1 release notes mention "Switch the MPI_ALLTOALLV default algorithm
>>> to a pairwise exchange", but I'm not clear what this means or how to
>>> back to the old "non-default algorithm".
>>> I attach a test program which illustrates the sort of usage in our MPI
>>> application. I have run as this as 32 processes on four nodes, over 1Gb
>>> ethernet, each node with 2x Opteron 4180 (dual hex-core); rank 0,4,8,..
>>> on node 1, rank 1,5,9, ... on node 2, etc.
>>> It constructs an array of integers and a nProcess x nProcess exchange
>>> of part of our application. This is then exchanged several thousand times.
>>> Output from "mpicc -O3" runs shown below.
>>> My guess is that 1.6.1 is hitting additional latency not present in 1.6.0.
>> I also
>>> attach a plot showing network throughput on our actual mesh generation
>>> application. Nodes cfsc01-04 are running 1.6.0 and finish within 35
>>> Nodes cfsc05-08 are running 1.6.2 (started 10 minutes later) and take over
>>> hour to run. There seems to be a much greater network demand in the 1.6.1
>>> version, despite the user-code and input data being identical.
>>> Thanks for any help you can give,
> users mailing list
Dipl.-Inform. Paul Kapinos - High Performance Computing,
RWTH Aachen University, Center for Computing and Communication
Seffenter Weg 23, D 52074 Aachen (Germany)
Tel: +49 241/80-24915