Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1
From: Number Cruncher (number.cruncher_at_[hidden])
Date: 2012-12-22 06:47:26

Thanks for the informative response. What I'm still not clear about is
whether there isn't a very simple optimization for the zero-size case.
If two processes know they aren't exchanging *any* data (which is known
from the argument list for all_to_allv), isn't there any network latency
or overhead in the sendrecv exchanges for this zero-exchange? The
previous algorithm just skipped this case; couldn't the pairwise one also?


On 21/12/2012 18:53, George Bosilca wrote:
> I can argue the opposite: in the most general case, each process will
> exchange data with all other processes, so a blocking approach as
> implemented in the current version make sense. As you noticed, this
> lead to poor results when the exchange pattern is sparse. We took what
> we believed is the most common usage of the alltoallv collective, and
> provided a default algorithm we consider the best for it (pairwise due
> to a tightly coupled structure of communications).
> However, as one of the main developers of the collective module, I'm
> not insensible to your argument. I would have loved to be able to
> alter the behavior of the alltoallv to adapt more carefully to the
> collective pattern itself. Unfortunately, it is very difficult as the
> alltoallv is one of the few collective, where the knowledge about the
> communication pattern is not evenly distributed among the peers (every
> rank knows only about the communications where it is involved). Thus,
> without requiring extra communications, the only valid parameter which
> can affect the selection of one of the underlying implementations is
> the number of participants in the collective (not even the number of
> participants exchanging real data, but the number of participants in
> the communicator). Not enough to make a smartest decision.
> As suggested several times already in this thread, there are quite a
> few MCA parameters that allow specialized behaviors for applications
> with communication patterns we did not considered as mainstream. You
> should definitively take advantage of these to further optimize your
> applications.
> George.
> On Dec 21, 2012, at 13:25 , Number Cruncher
> <number.cruncher_at_[hidden] <mailto:number.cruncher_at_[hidden]>>
> wrote:
>> I completely understand there's no "one size fits all", and I
>> appreciate that there are workarounds to the change in behaviour. I'm
>> only trying to make what little contribution I can by questioning the
>> architecture of the pairwise algorithm.
>> I know that for every user you please, there will be some that aren't
>> happy when a default changes (Windows 8 anyone?), but I'm trying to
>> provide some real-world data. If 90% of apps are 10% faster and 10%
>> are 1000% slower, should the default change?
>> all_to_all is a really nice primitive from a developer point of view.
>> Every process' code is symmetric and identical. Maybe I should have
>> to worry that most of the matrix exchange is sparse; I probably could
>> calculate an optimal exchange pattern. But I think this is the
>> implementation's job, and I will continue to argue that *waiting* for
>> each pairwise exchange is (a) unnecessary, (b) doesn't improve
>> performance for *any* application and (c) at worst causes huge
>> slowdown over the last algorithm for sparse cases.
>> In summary: I'm arguing that there's a problem with the pairwise
>> implementation as it stands. It doesn't have any optimization for
>> sparse all_to_all and imposes unnecessary synchronisation barriers in
>> all cases.
>> Simon
>> On 20/12/2012 14:42, Iliev, Hristo wrote:
>>> Simon,
>>> The goal of any MPI implementation is to be as fast as possible.
>>> Unfortunately there is no "one size fits all" algorithm that works on all
>>> networks and given all possible kind of peculiarities that your specific
>>> communication scheme may have. That's why there are different algorithms and
>>> you are given the option to dynamically select them at run time without the
>>> need to recompile the code. I don't think the change of the default
>>> algorithm (note that the pairwise algorithm has been there for many years -
>>> it is not new, simply the new default one) was introduced in order to piss
>>> users off.
>>> If you want OMPI to default to the previous algorithm:
>>> 1) Add this to the system-wide OMPI configuration file
>>> $sysconf/openmpi-mca-params.conf (wher $sysconf would most likely be
>>> $PREFIX/etc, where $PREFIX is the OMPI installation directory):
>>> coll_tuned_use_dynamic_rules = 1
>>> coll_tuned_alltoallv_algorithm = 1
>>> 2) The settings from (1) can be overridden on per user basis by the similar
>>> settings from $HOME/.openmpi/mca-params.conf.
>>> 3) The settings from (1) and (2) can be overridden on per job basis by
>>> exporting MCA parameters as environment variables:
>>> export OMPI_MCA_coll_tuned_use_dynamic_rules=1
>>> export OMPI_MCA_coll_tuned_alltoallv_algorithm=1
>>> 4) Finally, the settings from (1), (2), and (3) can be overridden on per MPI
>>> program launch by supplying appropriate MCA parameters to orterun (a.k.a.
>>> mpirun and mpiexec).
>>> There is also a largely undocumented feature of the "tuned" collective
>>> component where a dynamic rules file can be supplied. In the file a series
>>> of cases tell the library which implementation to use based on the
>>> communicator and message sizes. No idea if it works with ALLTOALLV.
>>> Kind regards,
>>> Hristo
>>> (sorry for top posting - damn you, Outlook!)
>>> --
>>> Hristo Iliev, Ph.D. -- High Performance Computing
>>> RWTH Aachen University, Center for Computing and Communication
>>> Rechen- und Kommunikationszentrum der RWTH Aachen
>>> Seffenter Weg 23, D 52074 Aachen (Germany)
>>>> -----Original Message-----
>>>> From:users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
>>>> On Behalf Of Number Cruncher
>>>> Sent: Wednesday, December 19, 2012 5:31 PM
>>>> To: Open MPI Users
>>>> Subject: Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to
>>>> 1.6.1
>>>> On 19/12/12 11:08, Paul Kapinos wrote:
>>>>> Did you *really* wanna to dig into code just in order to switch a
>>>>> default communication algorithm?
>>>> No, I didn't want to, but with a huge change in performance, I'm forced to
>>> do
>>>> something! And having looked at the different algorithms, I think there's
>>> a
>>>> problem with the new default whenever message sizes are small enough
>>>> that connection latency will dominate. We're not all running Infiniband,
>>> and
>>>> having to wait for each pairwise exchange to complete before initiating
>>>> another seems wrong if the latency in waiting for completion dominates the
>>>> transmission time.
>>>> E.g. If I have 10 small pairwise exchanges to perform,isn't it better to
>>> put all
>>>> 10 outbound messages on the wire, and wait for 10 matching inbound
>>>> messages, in any order? The new algorithm must wait for first exchange to
>>>> complete, then second exchange, then third. Unlike before, does it not
>>> have
>>>> to wait to acknowledge the matching *zero-sized* request? I don't see why
>>>> this temporal ordering matters.
>>>> Thanks for your help,
>>>> Simon
>>>>> Note there are several ways to set the parameters; --mca on command
>>>>> line is just one of them (suitable for quick online tests).
>>>>> We 'tune' our Open MPI by setting environment variables....
>>>>> Best
>>>>> Paul Kapinos
>>>>> On 12/19/12 11:44, Number Cruncher wrote:
>>>>>> Having run some more benchmarks, the new default is *really* bad for
>>>>>> our application (2-10x slower), so I've been looking at the source to
>>>>>> try and figure out why.
>>>>>> It seems that the biggest difference will occur when the all_to_all
>>>>>> is actually sparse (e.g. our application); if most N-M process
>>>>>> exchanges are zero in size the 1.6
>>>>>> ompi_coll_tuned_alltoallv_intra_basic_linear algorithm will actually
>>>>>> only post irecv/isend for non-zero exchanges; any zero-size exchanges
>>>>>> are skipped. It then waits once for all requests to complete. In
>>>>>> contrast, the new ompi_coll_tuned_alltoallv_intra_pairwise will post
>>>>>> the zero-size exchanges for
>>>>>> *every* N-M pair, and wait for each pairwise exchange. This is
>>>>>> O(comm_size)
>>>>>> waits, may of which are zero. I'm not clear what optimizations there
>>>>>> are for zero-size isend/irecv, but surely there's a great deal more
>>>>>> latency if each pairwise exchange has to be confirmed complete before
>>>>>> executing the next?
>>>>>> Relatedly, how would I direct OpenMPI to use the older algorithm
>>>>>> programmatically? I don't want the user to have to use "--mca" in
>>>>>> their "mpiexec". Is there a C API?
>>>>>> Thanks,
>>>>>> Simon
>>>>>> On 16/11/12 10:15, Iliev, Hristo wrote:
>>>>>>> Hi, Simon,
>>>>>>> The pairwise algorithm passes messages in a synchronised ring-like
>>>>>>> fashion with increasing stride, so it works best when independent
>>>>>>> communication paths could be established between several ports of
>>>>>>> the network switch/router. Some 1 Gbps Ethernet equipment is not
>>>>>>> capable of doing so, some is - it depends (usually on the price).
>>>>>>> This said, not all algorithms perform the same given a specific type
>>>>>>> of network interconnect. For example, on our fat-tree InfiniBand
>>>>>>> network the pairwise algorithm performs better.
>>>>>>> You can switch back to the basic linear algorithm by providing the
>>>>>>> following MCA parameters:
>>>>>>> mpiexec --mca coll_tuned_use_dynamic_rules 1 --mca
>>>>>>> coll_tuned_alltoallv_algorithm 1 ...
>>>>>>> Algorithm 1 is the basic linear, which used to be the default.
>>>>>>> Algorithm 2
>>>>>>> is the pairwise one.
>>>>>>> You can also set these values as exported environment variables:
>>>>>>> export OMPI_MCA_coll_tuned_use_dynamic_rules=1
>>>>>>> export OMPI_MCA_coll_tuned_alltoallv_algorithm=1
>>>>>>> mpiexec ...
>>>>>>> You can also put this in $HOME/.openmpi/mcaparams.conf or (to make
>>>>>>> it have global effect) in $OPAL_PREFIX/etc/openmpi-mca-params.conf:
>>>>>>> coll_tuned_use_dynamic_rules=1
>>>>>>> coll_tuned_alltoallv_algorithm=1
>>>>>>> A gratuitous hint: dual-Opteron systems are NUMAs so it makes sense
>>>>>>> to activate process binding with --bind-to-core if you haven't
>>>>>>> already did so.
>>>>>>> It prevents MPI processes from being migrated to other NUMA nodes
>>>>>>> while running.
>>>>>>> Kind regards,
>>>>>>> Hristo
>>>>>>> --
>>>>>>> Hristo Iliev, Ph.D. -- High Performance Computing RWTH Aachen
>>>>>>> University, Center for Computing and Communication
>>>>>>> Rechen- und Kommunikationszentrum der RWTH Aachen Seffenter Weg
>>>> 23,
>>>>>>> D 52074 Aachen (Germany)
>>>>>>>> -----Original Message-----
>>>>>>>> From:users-bounces_at_[hidden]
>>>>>>>> [mailto:users-bounces_at_[hidden]]
>>>>>>>> On Behalf Of Number Cruncher
>>>>>>>> Sent: Thursday, November 15, 2012 5:37 PM
>>>>>>>> To: Open MPI Users
>>>>>>>> Subject: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to
>>>>>>>> 1.6.1
>>>>>>>> I've noticed a very significant (100%) slow down for MPI_Alltoallv
>>>>>>>> calls
>>>>>>> as of
>>>>>>>> version 1.6.1.
>>>>>>>> * This is most noticeable for high-frequency exchanges over 1Gb
>>>>>>>> ethernet where process-to-process message sizes are fairly small
>>>>>>>> (e.g.
>>>>>>>> 100kbyte)
>>>>>>> and
>>>>>>>> much of the exchange matrix is sparse.
>>>>>>>> * 1.6.1 release notes mention "Switch the MPI_ALLTOALLV default
>>>>>>>> algorithm to a pairwise exchange", but I'm not clear what this
>>>>>>>> means or how to
>>>>>>> switch
>>>>>>>> back to the old "non-default algorithm".
>>>>>>>> I attach a test program which illustrates the sort of usage in our
>>>>>>>> MPI application. I have run as this as 32 processes on four nodes,
>>>>>>>> over 1Gb ethernet, each node with 2x Opteron 4180 (dual hex-core);
>>>>>>>> rank 0,4,8,..
>>>>>>>> on node 1, rank 1,5,9, ... on node 2, etc.
>>>>>>>> It constructs an array of integers and a nProcess x nProcess
>>>>>>>> exchange
>>>>>>> typical
>>>>>>>> of part of our application. This is then exchanged several thousand
>>>>>>>> times.
>>>>>>>> Output from "mpicc -O3" runs shown below.
>>>>>>>> My guess is that 1.6.1 is hitting additional latency not present in
>>>>>>>> 1.6.0.
>>>>>>> I also
>>>>>>>> attach a plot showing network throughput on our actual mesh
>>>>>>>> generation application. Nodes cfsc01-04 are running 1.6.0 and
>>>>>>>> finish within 35
>>>>>>> minutes.
>>>>>>>> Nodes cfsc05-08 are running 1.6.2 (started 10 minutes later) and
>>>>>>>> take over
>>>>>>> a
>>>>>>>> hour to run. There seems to be a much greater network demand in the
>>>>>>>> 1.6.1
>>>>>>>> version, despite the user-code and input data being identical.
>>>>>>>> Thanks for any help you can give,
>>>>>>>> Simon
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>> _______________________________________________
>> users mailing list
>> users_at_[hidden] <mailto:users_at_[hidden]>
> _______________________________________________
> users mailing list
> users_at_[hidden]