Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] All_to_allv algorithm patch
From: Iliev, Hristo (iliev_at_[hidden])
Date: 2013-02-05 04:46:44


Hi,

This is the users mailing list. There is a separate one for questions
related to Open MPI development - devel_at_open-mpi.org.

Besides, why don't you open a ticket in the Open MPI Trac at
https://svn.open-mpi.org/trac/ompi/ and post there patches against trunk? My
experience shows that even simple changes to the collective framework are of
low importance to the OMPI development team and chances of such changes
entering the 1.6.x branch are practically zero. By the way, the off-by-one
issue was reported almost an year ago and is already fixed in trunk.

I believe that the rationale behind the algorithm switch was given by either
Jeff or George some time ago and it was that the linear code does not scale
to a large number of processes.

Kind regards,
Hristo

> -----Original Message-----
> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
> On Behalf Of Number Cruncher
> Sent: Monday, February 04, 2013 5:19 PM
> To: Open MPI Users
> Subject: [OMPI users] All_to_allv algorithm patch
>
> I'll try running this by the mailing list again, before resigning myself
to
> maintaining this privately....
>
> I've looked in detail at the current two MPI_Alltoallv algorithms and
wanted
> to raise a couple of ideas.
>
> Firstly, the new default "pairwise" algorithm.
> * There is no optimisation for sparse/empty messages, compare to the old
> basic "linear" algorithm.
> * The attached "pairwise-nop" patch adds this optimisation and on the test
> case I first described in this thread (1000's of small, sparse,
all-to-all), this cuts
> runtime by approximately 30%
> * I think the upper bound on the loop counter for pairwise exchange is
off-
> by-one. As the comment notes "starting from 1 since local exhange [sic] is
> done"; but when step = (size + 1), the sendto/recvfrom both reduce to rank
> (self-exchange is already handled in earlier code)
>
> The pairwise algorithm still kills performance on my gigabit ethernet
network.
> My message transmission time must be small compared to latency, and the
> forced MPI_Comm_size() synchronisation steps introduce a minimum delay
> (single_link_latency * comm_size), i.e. latency scale linearly with
comm_size.
> The linear algorithm doesn't wait for each exchange, so its minimum
latency
> is just a single transmit/receive.
>
> Which brings me to the second idea. The problem with the existing
> implementation of the linear algorithm is that the irecv/isend pattern was
> identical on all processes, meaning that every process starts by having to
wait
> for process 0 to send to everyone and every process can finish waiting for
> rank (size-1) to send to everyone.
>
> It seems preferable to at least post the send/recv requests in the same
order
> as the pairwise algorithm. The attached "linear-alltoallv" patch
implements
> this and, on my test case, shows some modest 5% improvement.
> I was wondering if it would address the concerns which led to the switch
of
> default algorithm.
>
> Simon

--
Hristo Iliev, Ph.D. -- High Performance Computing
RWTH Aachen University, Center for Computing and Communication
Rechen- und Kommunikationszentrum der RWTH Aachen
Seffenter Weg 23,  D 52074  Aachen (Germany)


  • application/pkcs7-signature attachment: smime.p7s