Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1
From: Number Cruncher (number.cruncher_at_[hidden])
Date: 2012-12-21 07:25:44


I completely understand there's no "one size fits all", and I appreciate
that there are workarounds to the change in behaviour. I'm only trying
to make what little contribution I can by questioning the architecture
of the pairwise algorithm.

I know that for every user you please, there will be some that aren't
happy when a default changes (Windows 8 anyone?), but I'm trying to
provide some real-world data. If 90% of apps are 10% faster and 10% are
1000% slower, should the default change?

all_to_all is a really nice primitive from a developer point of view.
Every process' code is symmetric and identical. Maybe I should have to
worry that most of the matrix exchange is sparse; I probably could
calculate an optimal exchange pattern. But I think this is the
implementation's job, and I will continue to argue that *waiting* for
each pairwise exchange is (a) unnecessary, (b) doesn't improve
performance for *any* application and (c) at worst causes huge slowdown
over the last algorithm for sparse cases.

In summary: I'm arguing that there's a problem with the pairwise
implementation as it stands. It doesn't have any optimization for sparse
all_to_all and imposes unnecessary synchronisation barriers in all cases.

Simon

On 20/12/2012 14:42, Iliev, Hristo wrote:
> Simon,
>
> The goal of any MPI implementation is to be as fast as possible.
> Unfortunately there is no "one size fits all" algorithm that works on all
> networks and given all possible kind of peculiarities that your specific
> communication scheme may have. That's why there are different algorithms and
> you are given the option to dynamically select them at run time without the
> need to recompile the code. I don't think the change of the default
> algorithm (note that the pairwise algorithm has been there for many years -
> it is not new, simply the new default one) was introduced in order to piss
> users off.
>
> If you want OMPI to default to the previous algorithm:
>
> 1) Add this to the system-wide OMPI configuration file
> $sysconf/openmpi-mca-params.conf (wher $sysconf would most likely be
> $PREFIX/etc, where $PREFIX is the OMPI installation directory):
> coll_tuned_use_dynamic_rules = 1
> coll_tuned_alltoallv_algorithm = 1
>
> 2) The settings from (1) can be overridden on per user basis by the similar
> settings from $HOME/.openmpi/mca-params.conf.
>
> 3) The settings from (1) and (2) can be overridden on per job basis by
> exporting MCA parameters as environment variables:
> export OMPI_MCA_coll_tuned_use_dynamic_rules=1
> export OMPI_MCA_coll_tuned_alltoallv_algorithm=1
>
> 4) Finally, the settings from (1), (2), and (3) can be overridden on per MPI
> program launch by supplying appropriate MCA parameters to orterun (a.k.a.
> mpirun and mpiexec).
>
> There is also a largely undocumented feature of the "tuned" collective
> component where a dynamic rules file can be supplied. In the file a series
> of cases tell the library which implementation to use based on the
> communicator and message sizes. No idea if it works with ALLTOALLV.
>
> Kind regards,
> Hristo
>
> (sorry for top posting - damn you, Outlook!)
> --
> Hristo Iliev, Ph.D. -- High Performance Computing
> RWTH Aachen University, Center for Computing and Communication
> Rechen- und Kommunikationszentrum der RWTH Aachen
> Seffenter Weg 23, D 52074 Aachen (Germany)
>
>> -----Original Message-----
>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
>> On Behalf Of Number Cruncher
>> Sent: Wednesday, December 19, 2012 5:31 PM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to
>> 1.6.1
>>
>> On 19/12/12 11:08, Paul Kapinos wrote:
>>> Did you *really* wanna to dig into code just in order to switch a
>>> default communication algorithm?
>> No, I didn't want to, but with a huge change in performance, I'm forced to
> do
>> something! And having looked at the different algorithms, I think there's
> a
>> problem with the new default whenever message sizes are small enough
>> that connection latency will dominate. We're not all running Infiniband,
> and
>> having to wait for each pairwise exchange to complete before initiating
>> another seems wrong if the latency in waiting for completion dominates the
>> transmission time.
>>
>> E.g. If I have 10 small pairwise exchanges to perform,isn't it better to
> put all
>> 10 outbound messages on the wire, and wait for 10 matching inbound
>> messages, in any order? The new algorithm must wait for first exchange to
>> complete, then second exchange, then third. Unlike before, does it not
> have
>> to wait to acknowledge the matching *zero-sized* request? I don't see why
>> this temporal ordering matters.
>>
>> Thanks for your help,
>> Simon
>>
>>
>>
>>
>>> Note there are several ways to set the parameters; --mca on command
>>> line is just one of them (suitable for quick online tests).
>>>
>>> http://www.open-mpi.org/faq/?category=tuning#setting-mca-params
>>>
>>> We 'tune' our Open MPI by setting environment variables....
>>>
>>> Best
>>> Paul Kapinos
>>>
>>>
>>>
>>> On 12/19/12 11:44, Number Cruncher wrote:
>>>> Having run some more benchmarks, the new default is *really* bad for
>>>> our application (2-10x slower), so I've been looking at the source to
>>>> try and figure out why.
>>>>
>>>> It seems that the biggest difference will occur when the all_to_all
>>>> is actually sparse (e.g. our application); if most N-M process
>>>> exchanges are zero in size the 1.6
>>>> ompi_coll_tuned_alltoallv_intra_basic_linear algorithm will actually
>>>> only post irecv/isend for non-zero exchanges; any zero-size exchanges
>>>> are skipped. It then waits once for all requests to complete. In
>>>> contrast, the new ompi_coll_tuned_alltoallv_intra_pairwise will post
>>>> the zero-size exchanges for
>>>> *every* N-M pair, and wait for each pairwise exchange. This is
>>>> O(comm_size)
>>>> waits, may of which are zero. I'm not clear what optimizations there
>>>> are for zero-size isend/irecv, but surely there's a great deal more
>>>> latency if each pairwise exchange has to be confirmed complete before
>>>> executing the next?
>>>>
>>>> Relatedly, how would I direct OpenMPI to use the older algorithm
>>>> programmatically? I don't want the user to have to use "--mca" in
>>>> their "mpiexec". Is there a C API?
>>>>
>>>> Thanks,
>>>> Simon
>>>>
>>>>
>>>> On 16/11/12 10:15, Iliev, Hristo wrote:
>>>>> Hi, Simon,
>>>>>
>>>>> The pairwise algorithm passes messages in a synchronised ring-like
>>>>> fashion with increasing stride, so it works best when independent
>>>>> communication paths could be established between several ports of
>>>>> the network switch/router. Some 1 Gbps Ethernet equipment is not
>>>>> capable of doing so, some is - it depends (usually on the price).
>>>>> This said, not all algorithms perform the same given a specific type
>>>>> of network interconnect. For example, on our fat-tree InfiniBand
>>>>> network the pairwise algorithm performs better.
>>>>>
>>>>> You can switch back to the basic linear algorithm by providing the
>>>>> following MCA parameters:
>>>>>
>>>>> mpiexec --mca coll_tuned_use_dynamic_rules 1 --mca
>>>>> coll_tuned_alltoallv_algorithm 1 ...
>>>>>
>>>>> Algorithm 1 is the basic linear, which used to be the default.
>>>>> Algorithm 2
>>>>> is the pairwise one.
>>>>> You can also set these values as exported environment variables:
>>>>>
>>>>> export OMPI_MCA_coll_tuned_use_dynamic_rules=1
>>>>> export OMPI_MCA_coll_tuned_alltoallv_algorithm=1
>>>>> mpiexec ...
>>>>>
>>>>> You can also put this in $HOME/.openmpi/mcaparams.conf or (to make
>>>>> it have global effect) in $OPAL_PREFIX/etc/openmpi-mca-params.conf:
>>>>>
>>>>> coll_tuned_use_dynamic_rules=1
>>>>> coll_tuned_alltoallv_algorithm=1
>>>>>
>>>>> A gratuitous hint: dual-Opteron systems are NUMAs so it makes sense
>>>>> to activate process binding with --bind-to-core if you haven't
>>>>> already did so.
>>>>> It prevents MPI processes from being migrated to other NUMA nodes
>>>>> while running.
>>>>>
>>>>> Kind regards,
>>>>> Hristo
>>>>> --
>>>>> Hristo Iliev, Ph.D. -- High Performance Computing RWTH Aachen
>>>>> University, Center for Computing and Communication
>>>>> Rechen- und Kommunikationszentrum der RWTH Aachen Seffenter Weg
>> 23,
>>>>> D 52074 Aachen (Germany)
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: users-bounces_at_[hidden]
>>>>>> [mailto:users-bounces_at_[hidden]]
>>>>>> On Behalf Of Number Cruncher
>>>>>> Sent: Thursday, November 15, 2012 5:37 PM
>>>>>> To: Open MPI Users
>>>>>> Subject: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to
>>>>>> 1.6.1
>>>>>>
>>>>>> I've noticed a very significant (100%) slow down for MPI_Alltoallv
>>>>>> calls
>>>>> as of
>>>>>> version 1.6.1.
>>>>>> * This is most noticeable for high-frequency exchanges over 1Gb
>>>>>> ethernet where process-to-process message sizes are fairly small
>>>>>> (e.g.
>>>>>> 100kbyte)
>>>>> and
>>>>>> much of the exchange matrix is sparse.
>>>>>> * 1.6.1 release notes mention "Switch the MPI_ALLTOALLV default
>>>>>> algorithm to a pairwise exchange", but I'm not clear what this
>>>>>> means or how to
>>>>> switch
>>>>>> back to the old "non-default algorithm".
>>>>>>
>>>>>> I attach a test program which illustrates the sort of usage in our
>>>>>> MPI application. I have run as this as 32 processes on four nodes,
>>>>>> over 1Gb ethernet, each node with 2x Opteron 4180 (dual hex-core);
>>>>>> rank 0,4,8,..
>>>>>> on node 1, rank 1,5,9, ... on node 2, etc.
>>>>>>
>>>>>> It constructs an array of integers and a nProcess x nProcess
>>>>>> exchange
>>>>> typical
>>>>>> of part of our application. This is then exchanged several thousand
>>>>>> times.
>>>>>> Output from "mpicc -O3" runs shown below.
>>>>>>
>>>>>> My guess is that 1.6.1 is hitting additional latency not present in
>>>>>> 1.6.0.
>>>>> I also
>>>>>> attach a plot showing network throughput on our actual mesh
>>>>>> generation application. Nodes cfsc01-04 are running 1.6.0 and
>>>>>> finish within 35
>>>>> minutes.
>>>>>> Nodes cfsc05-08 are running 1.6.2 (started 10 minutes later) and
>>>>>> take over
>>>>> a
>>>>>> hour to run. There seems to be a much greater network demand in the
>>>>>> 1.6.1
>>>>>> version, despite the user-code and input data being identical.
>>>>>>
>>>>>> Thanks for any help you can give,
>>>>>> Simon
>>>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users