Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1
From: Number Cruncher (number.cruncher_at_[hidden])
Date: 2012-12-22 06:47:26

Thanks for the informative response. What I'm still not clear about is
whether there isn't a very simple optimization for the zero-size case.
If two processes know they aren't exchanging *any* data (which is known
from the argument list for all_to_allv), isn't there any network latency
or overhead in the sendrecv exchanges for this zero-exchange? The
previous algorithm just skipped this case; couldn't the pairwise one also?


On 21/12/2012 18:53, George Bosilca wrote:
> I can argue the opposite: in the most general case, each process will
> exchange data with all other processes, so a blocking approach as
> implemented in the current version make sense. As you noticed, this
> lead to poor results when the exchange pattern is sparse. We took what
> we believed is the most common usage of the alltoallv collective, and
> provided a default algorithm we consider the best for it (pairwise due
> to a tightly coupled structure of communications).
> However, as one of the main developers of the collective module, I'm
> not insensible to your argument. I would have loved to be able to
> alter the behavior of the alltoallv to adapt more carefully to the
> collective pattern itself. Unfortunately, it is very difficult as the
> alltoallv is one of the few collective, where the knowledge about the
> communication pattern is not evenly distributed among the peers (every
> rank knows only about the communications where it is involved). Thus,
> without requiring extra communications, the only valid parameter which
> can affect the selection of one of the underlying implementations is
> the number of participants in the collective (not even the number of
> participants exchanging real data, but the number of participants in
> the communicator). Not enough to make a smartest decision.
> As suggested several times already in this thread, there are quite a
> few MCA parameters that allow specialized behaviors for applications
> with communication patterns we did not considered as mainstream. You
> should definitively take advantage of these to further optimize your
> applications.
> George.
> On Dec 21, 2012, at 13:25 , Number Cruncher
> <number.cruncher_at_[hidden] <mailto:number.cruncher_at_[hidden]>>
> wrote:
>> I completely understand there's no "one size fits all", and I
>> appreciate that there are workarounds to the change in behaviour. I'm
>> only trying to make what little contribution I can by questioning the
>> architecture of the pairwise algorithm.
>> I know that for every user you please, there will be some that aren't
>> happy when a default changes (Windows 8 anyone?), but I'm trying to
>> provide some real-world data. If 90% of apps are 10% faster and 10%
>> are 1000% slower, should the default change?
>> all_to_all is a really nice primitive from a developer point of view.
>> Every process' code is symmetric and identical. Maybe I should have
>> to worry that most of the matrix exchange is sparse; I probably could
>> calculate an optimal exchange pattern. But I think this is the
>> implementation's job, and I will continue to argue that *waiting* for
>> each pairwise exchange is (a) unnecessary, (b) doesn't improve
>> performance for *any* application and (c) at worst causes huge
>> slowdown over the last algorithm for sparse cases.
>> In summary: I'm arguing that there's a problem with the pairwise
>> implementation as it stands. It doesn't have any optimization for
>> sparse all_to_all and imposes unnecessary synchronisation barriers in
>> all cases.
>> Simon
>> On 20/12/2012 14:42, Iliev, Hristo wrote:
>>> Simon,
>>> The goal of any MPI implementation is to be as fast as possible.
>>> Unfortunately there is no "one size fits all" algorithm that works on all
>>> networks and given all possible kind of peculiarities that your specific
>>> communication scheme may have. That's why there are different algorithms and
>>> you are given the option to dynamically select them at run time without the
>>> need to recompile the code. I don't think the change of the default
>>> algorithm (note that the pairwise algorithm has been there for many years -
>>> it is not new, simply the new default one) was introduced in order to piss
>>> users off.
>>> If you want OMPI to default to the previous algorithm:
>>> 1) Add this to the system-wide OMPI configuration file
>>> $sysconf/openmpi-mca-params.conf (wher $sysconf would most likely be
>>> $PREFIX/etc, where $PREFIX is the OMPI installation directory):
>>> coll_tuned_use_dynamic_rules = 1
>>> coll_tuned_alltoallv_algorithm = 1
>>> 2) The settings from (1) can be overridden on per user basis by the similar
>>> settings from $HOME/.openmpi/mca-params.conf.
>>> 3) The settings from (1) and (2) can be overridden on per job basis by
>>> exporting MCA parameters as environment variables:
>>> export OMPI_MCA_coll_tuned_use_dynamic_rules=1
>>> export OMPI_MCA_coll_tuned_alltoallv_algorithm=1
>>> 4) Finally, the settings from (1), (2), and (3) can be overridden on per MPI
>>> program launch by supplying appropriate MCA parameters to orterun (a.k.a.
>>> mpirun and mpiexec).
>>> There is also a largely undocumented feature of the "tuned" collective
>>> component where a dynamic rules file can be supplied. In the file a series
>>> of cases tell the library which implementation to use based on the
>>> communicator and message sizes. No idea if it works with ALLTOALLV.
>>> Kind regards,
>>> Hristo
>>> (sorry for top posting - damn you, Outlook!)
>>> --
>>> Hristo Iliev, Ph.D. -- High Performance Computing
>>> RWTH Aachen University, Center for Computing and Communication
>>> Rechen- und Kommunikationszentrum der RWTH Aachen
>>> Seffenter Weg 23, D 52074 Aachen (Germany)
>>>> -----Original Message-----
>>>> From:users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
>>>> On Behalf Of Number Cruncher
>>>> Sent: Wednesday, December 19, 2012 5:31 PM
>>>> To: Open MPI Users
>>>> Subject: Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to
>>>> 1.6.1
>>>> On 19/12/12 11:08, Paul Kapinos wrote:
>>>>> Did you *really* wanna to dig into code just in order to switch a
>>>>> default communication algorithm?
>>>> No, I didn't want to, but with a huge change in performance, I'm forced to
>>> do
>>>> something! And having looked at the different algorithms, I think there's
>>> a
>>>> problem with the new default whenever message sizes are small enough
>>>> that connection latency will dominate. We're not all running Infiniband,
>>> and
>>>> having to wait for each pairwise exchange to complete before initiating
>>>> another seems wrong if the latency in waiting for completion dominates the
>>>> transmission time.
>>>> E.g. If I have 10 small pairwise exchanges to perform,isn't it better to
>>> put all
>>>> 10 outbound messages on the wire, and wait for 10 matching inbound
>>>> messages, in any order? The new algorithm must wait for first exchange to
>>>> complete, then second exchange, then third. Unlike before, does it not
>>> have
>>>> to wait to acknowledge the matching *zero-sized* request? I don't see why
>>>> this temporal ordering matters.
>>>> Thanks for your help,
>>>> Simon
>>>>> Note there are several ways to set the parameters; --mca on command
>>>>> line is just one of them (suitable for quick online tests).
>>>>> We 'tune' our Open MPI by setting environment variables....
>>>>> Best
>>>>> Paul Kapinos
>>>>> On 12/19/12 11:44, Number Cruncher wrote:
>>>>>> Having run some more benchmarks, the new default is *really* bad for
>>>>>> our application (2-10x slower), so I've been looking at the source to
>>>>>> try and figure out why.
>>>>>> It seems that the biggest difference will occur when the all_to_all
>>>>>> is actually sparse (e.g. our application); if most N-M process
>>>>>> exchanges are zero in size the 1.6
>>>>>> ompi_coll_tuned_alltoallv_intra_basic_linear algorithm will actually
>>>>>> only post irecv/isend for non-zero exchanges; any zero-size exchanges
>>>>>> are skipped. It then waits once for all requests to complete. In
>>>>>> contrast, the new ompi_coll_tuned_alltoallv_intra_pairwise will post
>>>>>> the zero-size exchanges for
>>>>>> *every* N-M pair, and wait for each pairwise exchange. This is
>>>>>> O(comm_size)
>>>>>> waits, may of which are zero. I'm not clear what optimizations there
>>>>>> are for zero-size isend/irecv, but surely there's a great deal more
>>>>>> latency if each pairwise exchange has to be confirmed complete before
>>>>>> executing the next?
>>>>>> Relatedly, how would I direct OpenMPI to use the older algorithm
>>>>>> programmatically? I don't want the user to have to use "--mca" in
>>>>>> their "mpiexec". Is there a C API?
>>>>>> Thanks,
>>>>>> Simon
>>>>>> On 16/11/12 10:15, Iliev, Hristo wrote:
>>>>>>> Hi, Simon,
>>>>>>> The pairwise algorithm passes messages in a synchronised ring-like
>>>>>>> fashion with increasing stride, so it works best when independent
>>>>>>> communication paths could be established between several ports of
>>>>>>> the network switch/router. Some 1 Gbps Ethernet equipment is not
>>>>>>> capable of doing so, some is - it depends (usually on the price).
>>>>>>> This said, not all algorithms perform the same given a specific type
>>>>>>> of network interconnect. For example, on our fat-tree InfiniBand
>>>>>>> network the pairwise algorithm performs better.
>>>>>>> You can switch back to the basic linear algorithm by providing the
>>>>>>> following MCA parameters:
>>>>>>> mpiexec --mca coll_tuned_use_dynamic_rules 1 --mca
>>>>>>> coll_tuned_alltoallv_algorithm 1 ...
>>>>>>> Algorithm 1 is the basic linear, which used to be the default.
>>>>>>> Algorithm 2
>>>>>>> is the pairwise one.
>>>>>>> You can also set these values as exported environment variables:
>>>>>>> export OMPI_MCA_coll_tuned_use_dynamic_rules=1
>>>>>>> export OMPI_MCA_coll_tuned_alltoallv_algorithm=1
>>>>>>> mpiexec ...
>>>>>>> You can also put this in $HOME/.openmpi/mcaparams.conf or (to make
>>>>>>> it have global effect) in $OPAL_PREFIX/etc/openmpi-mca-params.conf:
>>>>>>> coll_tuned_use_dynamic_rules=1
>>>>>>> coll_tuned_alltoallv_algorithm=1
>>>>>>> A gratuitous hint: dual-Opteron systems are NUMAs so it makes sense
>>>>>>> to activate process binding with --bind-to-core if you haven't
>>>>>>> already did so.
>>>>>>> It prevents MPI processes from being migrated to other NUMA nodes
>>>>>>> while running.
>>>>>>> Kind regards,
>>>>>>> Hristo
>>>>>>> --
>>>>>>> Hristo Iliev, Ph.D. -- High Performance Computing RWTH Aachen
>>>>>>> University, Center for Computing and Communication
>>>>>>> Rechen- und Kommunikationszentrum der RWTH Aachen Seffenter Weg
>>>> 23,
>>>>>>> D 52074 Aachen (Germany)
>>>>>>>> -----Original Message-----
>>>>>>>> From:users-bounces_at_[hidden]
>>>>>>>> [mailto:users-bounces_at_[hidden]]
>>>>>>>> On Behalf Of Number Cruncher
>>>>>>>> Sent: Thursday, November 15, 2012 5:37 PM
>>>>>>>> To: Open MPI Users
>>>>>>>> Subject: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to
>>>>>>>> 1.6.1
>>>>>>>> I've noticed a very significant (100%) slow down for MPI_Alltoallv
>>>>>>>> calls
>>>>>>> as of
>>>>>>>> version 1.6.1.
>>>>>>>> * This is most noticeable for high-frequency exchanges over 1Gb
>>>>>>>> ethernet where process-to-process message sizes are fairly small
>>>>>>>> (e.g.
>>>>>>>> 100kbyte)
>>>>>>> and
>>>>>>>> much of the exchange matrix is sparse.
>>>>>>>> * 1.6.1 release notes mention "Switch the MPI_ALLTOALLV default
>>>>>>>> algorithm to a pairwise exchange", but I'm not clear what this
>>>>>>>> means or how to
>>>>>>> switch
>>>>>>>> back to the old "non-default algorithm".
>>>>>>>> I attach a test program which illustrates the sort of usage in our
>>>>>>>> MPI application. I have run as this as 32 processes on four nodes,
>>>>>>>> over 1Gb ethernet, each node with 2x Opteron 4180 (dual hex-core);
>>>>>>>> rank 0,4,8,..
>>>>>>>> on node 1, rank 1,5,9, ... on node 2, etc.
>>>>>>>> It constructs an array of integers and a nProcess x nProcess
>>>>>>>> exchange
>>>>>>> typical
>>>>>>>> of part of our application. This is then exchanged several thousand
>>>>>>>> times.
>>>>>>>> Output from "mpicc -O3" runs shown below.
>>>>>>>> My guess is that 1.6.1 is hitting additional latency not present in
>>>>>>>> 1.6.0.
>>>>>>> I also
>>>>>>>> attach a plot showing network throughput on our actual mesh
>>>>>>>> generation application. Nodes cfsc01-04 are running 1.6.0 and
>>>>>>>> finish within 35
>>>>>>> minutes.
>>>>>>>> Nodes cfsc05-08 are running 1.6.2 (started 10 minutes later) and
>>>>>>>> take over
>>>>>>> a
>>>>>>>> hour to run. There seems to be a much greater network demand in the
>>>>>>>> 1.6.1
>>>>>>>> version, despite the user-code and input data being identical.
>>>>>>>> Thanks for any help you can give,
>>>>>>>> Simon
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>> _______________________________________________
>> users mailing list
>> users_at_[hidden] <mailto:users_at_[hidden]>
> _______________________________________________
> users mailing list
> users_at_[hidden]