Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1
From: Number Cruncher (number.cruncher_at_[hidden])
Date: 2012-12-19 11:31:08


On 19/12/12 11:08, Paul Kapinos wrote:
> Did you *really* wanna to dig into code just in order to switch a
> default communication algorithm?

No, I didn't want to, but with a huge change in performance, I'm forced
to do something! And having looked at the different algorithms, I think
there's a problem with the new default whenever message sizes are small
enough that connection latency will dominate. We're not all running
Infiniband, and having to wait for each pairwise exchange to complete
before initiating another seems wrong if the latency in waiting for
completion dominates the transmission time.

E.g. If I have 10 small pairwise exchanges to perform,isn't it better to
put all 10 outbound messages on the wire, and wait for 10 matching
inbound messages, in any order? The new algorithm must wait for first
exchange to complete, then second exchange, then third. Unlike before,
does it not have to wait to acknowledge the matching *zero-sized*
request? I don't see why this temporal ordering matters.

Thanks for your help,
Simon

>
> Note there are several ways to set the parameters; --mca on command
> line is just one of them (suitable for quick online tests).
>
> http://www.open-mpi.org/faq/?category=tuning#setting-mca-params
>
> We 'tune' our Open MPI by setting environment variables....
>
> Best
> Paul Kapinos
>
>
>
> On 12/19/12 11:44, Number Cruncher wrote:
>> Having run some more benchmarks, the new default is *really* bad for our
>> application (2-10x slower), so I've been looking at the source to try
>> and figure
>> out why.
>>
>> It seems that the biggest difference will occur when the all_to_all
>> is actually
>> sparse (e.g. our application); if most N-M process exchanges are zero
>> in size
>> the 1.6 ompi_coll_tuned_alltoallv_intra_basic_linear algorithm will
>> actually
>> only post irecv/isend for non-zero exchanges; any zero-size exchanges
>> are
>> skipped. It then waits once for all requests to complete. In
>> contrast, the new
>> ompi_coll_tuned_alltoallv_intra_pairwise will post the zero-size
>> exchanges for
>> *every* N-M pair, and wait for each pairwise exchange. This is
>> O(comm_size)
>> waits, may of which are zero. I'm not clear what optimizations there
>> are for
>> zero-size isend/irecv, but surely there's a great deal more latency
>> if each
>> pairwise exchange has to be confirmed complete before executing the
>> next?
>>
>> Relatedly, how would I direct OpenMPI to use the older algorithm
>> programmatically? I don't want the user to have to use "--mca" in their
>> "mpiexec". Is there a C API?
>>
>> Thanks,
>> Simon
>>
>>
>> On 16/11/12 10:15, Iliev, Hristo wrote:
>>> Hi, Simon,
>>>
>>> The pairwise algorithm passes messages in a synchronised ring-like
>>> fashion
>>> with increasing stride, so it works best when independent communication
>>> paths could be established between several ports of the network
>>> switch/router. Some 1 Gbps Ethernet equipment is not capable of
>>> doing so,
>>> some is - it depends (usually on the price). This said, not all
>>> algorithms
>>> perform the same given a specific type of network interconnect. For
>>> example,
>>> on our fat-tree InfiniBand network the pairwise algorithm performs
>>> better.
>>>
>>> You can switch back to the basic linear algorithm by providing the
>>> following
>>> MCA parameters:
>>>
>>> mpiexec --mca coll_tuned_use_dynamic_rules 1 --mca
>>> coll_tuned_alltoallv_algorithm 1 ...
>>>
>>> Algorithm 1 is the basic linear, which used to be the default.
>>> Algorithm 2
>>> is the pairwise one.
>>> You can also set these values as exported environment variables:
>>>
>>> export OMPI_MCA_coll_tuned_use_dynamic_rules=1
>>> export OMPI_MCA_coll_tuned_alltoallv_algorithm=1
>>> mpiexec ...
>>>
>>> You can also put this in $HOME/.openmpi/mcaparams.conf or (to make
>>> it have
>>> global effect) in $OPAL_PREFIX/etc/openmpi-mca-params.conf:
>>>
>>> coll_tuned_use_dynamic_rules=1
>>> coll_tuned_alltoallv_algorithm=1
>>>
>>> A gratuitous hint: dual-Opteron systems are NUMAs so it makes sense to
>>> activate process binding with --bind-to-core if you haven't already
>>> did so.
>>> It prevents MPI processes from being migrated to other NUMA nodes while
>>> running.
>>>
>>> Kind regards,
>>> Hristo
>>> --
>>> Hristo Iliev, Ph.D. -- High Performance Computing
>>> RWTH Aachen University, Center for Computing and Communication
>>> Rechen- und Kommunikationszentrum der RWTH Aachen
>>> Seffenter Weg 23, D 52074 Aachen (Germany)
>>>
>>>
>>>> -----Original Message-----
>>>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
>>>> On Behalf Of Number Cruncher
>>>> Sent: Thursday, November 15, 2012 5:37 PM
>>>> To: Open MPI Users
>>>> Subject: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to
>>>> 1.6.1
>>>>
>>>> I've noticed a very significant (100%) slow down for MPI_Alltoallv
>>>> calls
>>> as of
>>>> version 1.6.1.
>>>> * This is most noticeable for high-frequency exchanges over 1Gb
>>>> ethernet
>>>> where process-to-process message sizes are fairly small (e.g.
>>>> 100kbyte)
>>> and
>>>> much of the exchange matrix is sparse.
>>>> * 1.6.1 release notes mention "Switch the MPI_ALLTOALLV default
>>>> algorithm
>>>> to a pairwise exchange", but I'm not clear what this means or how to
>>> switch
>>>> back to the old "non-default algorithm".
>>>>
>>>> I attach a test program which illustrates the sort of usage in our MPI
>>>> application. I have run as this as 32 processes on four nodes, over
>>>> 1Gb
>>>> ethernet, each node with 2x Opteron 4180 (dual hex-core); rank
>>>> 0,4,8,..
>>>> on node 1, rank 1,5,9, ... on node 2, etc.
>>>>
>>>> It constructs an array of integers and a nProcess x nProcess exchange
>>> typical
>>>> of part of our application. This is then exchanged several thousand
>>>> times.
>>>> Output from "mpicc -O3" runs shown below.
>>>>
>>>> My guess is that 1.6.1 is hitting additional latency not present in
>>>> 1.6.0.
>>> I also
>>>> attach a plot showing network throughput on our actual mesh generation
>>>> application. Nodes cfsc01-04 are running 1.6.0 and finish within 35
>>> minutes.
>>>> Nodes cfsc05-08 are running 1.6.2 (started 10 minutes later) and
>>>> take over
>>> a
>>>> hour to run. There seems to be a much greater network demand in the
>>>> 1.6.1
>>>> version, despite the user-code and input data being identical.
>>>>
>>>> Thanks for any help you can give,
>>>> Simon
>>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>