Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1
From: Number Cruncher (number.cruncher_at_[hidden])
Date: 2012-11-15 11:37:25


I've noticed a very significant (100%) slow down for MPI_Alltoallv calls
as of version 1.6.1.
* This is most noticeable for high-frequency exchanges over 1Gb ethernet
where process-to-process message sizes are fairly small (e.g. 100kbyte)
and much of the exchange matrix is sparse.
* 1.6.1 release notes mention "Switch the MPI_ALLTOALLV default
algorithm to a pairwise exchange", but I'm not clear what this means or
how to switch back to the old "non-default algorithm".

I attach a test program which illustrates the sort of usage in our MPI
application. I have run as this as 32 processes on four nodes, over 1Gb
ethernet, each node with 2x Opteron 4180 (dual hex-core); rank 0,4,8,..
on node 1, rank 1,5,9, ... on node 2, etc.

It constructs an array of integers and a nProcess x nProcess exchange
typical of part of our application. This is then exchanged several
thousand times. Output from "mpicc -O3" runs shown below.

My guess is that 1.6.1 is hitting additional latency not present in
1.6.0. I also attach a plot showing network throughput on our actual
mesh generation application. Nodes cfsc01-04 are running 1.6.0 and
finish within 35 minutes. Nodes cfsc05-08 are running 1.6.2 (started 10
minutes later) and take over a hour to run. There seems to be a much
greater network demand in the 1.6.1 version, despite the user-code and
input data being identical.

Thanks for any help you can give,
Simon

For 1.6.0:

Open MPI 1.6.0
Proc 0: 50 38 29 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0
0 Total: 198 x 100 int
Proc 1: 38 29 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0
0 Total: 148 x 100 int
Proc 2: 29 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0
0 Total: 109 x 100 int
Proc 3: 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0
0 Total: 80 x 100 int
Proc 4: 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0
0 Total: 58 x 100 int
Proc 5: 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0
0 Total: 41 x 100 int
Proc 6: 8 6 4 3 2 1 0 0 0 0 0 0 0 0
0 Total: 29 x 100 int
Proc 7: 6 4 3 2 1 0 0 0 0 0 0 0 0
0 Total: 20 x 100 int
Proc 8: 4 3 2 1 0 0 0 0 0 0 0 0
0 Total: 14 x
100 int
Proc 9: 3 2 1 0 0 0 0 0 0 0 0
0 Total: 9 x
100 int
Proc 10: 2 1 0 0 0 0 0 0 0 0 0 Total: 6 x 100 int
Proc 11: 1 0 0 0 0 0 0 0 0
0 0
Total: 4 x 100 int
Proc 12: 0 0 0 0 0 0 0 0
0 0 0
Total: 2 x 100 int
Proc 13: 0 0 0 0 0 0 0
0 0 0 0
Total: 1 x 100 int
Proc 14: 0 0 0 0 0 0
0 0 0
0 0 Total: 1 x 100 int
Proc 15: 0 0 0 0 0
0 0 0
0 0 0 Total: 0 x 100 int
Proc 16: 0 0 0 0
0 0 0
0 0 0 0 Total: 0 x 100 int
Proc 17: 0 0 0
0 0 0
0 0 0 0 0 Total: 1 x 100 int
Proc 18: 0 0
0 0 0
0 0 0 0 0 0 Total: 1 x 100 int
Proc 19: 0
0 0 0
0 0 0 0 0 0 0 Total: 2 x 100 int
Proc 20:
0 0 0
0 0 0 0 0 0 0 1 Total: 4 x 100 int
Proc 21: 0 0 0 0 0 0 0 0 0 1 2 Total: 6 x 100 int
Proc 22: 0
0 0 0 0 0 0 0 0 1 2 3 Total: 9 x 100 int
Proc 23: 0 0 0 0 0 0 0 0 0 1 2 3 4 Total: 14 x 100 int
Proc 24: 0 0 0
0 0 0 0 0 0 1 2 3 4 6 Total: 20 x 100 int
Proc 25: 0 0 0 0
0 0 0 0 0 1 2 3 4 6 8 Total: 29 x 100 int
Proc 26: 0 0 0 0 0
0 0 0 0 1 2 3 4 6 8 12 Total: 41 x 100 int
Proc 27: 0 0 0 0 0 0
0 0 0 1 2 3 4 6 8 12 16 Total: 58 x 100 int
Proc 28: 0 0 0 0 0 0 0
0 0 1 2 3 4 6 8 12 16 22 Total: 80 x 100 int
Proc 29: 0 0 0 0 0 0 0 0
0 1 2 3 4 6 8 12 16 22 29 Total: 109 x 100 int
Proc 30: 0 0 0 0 0 0 0 0 0
1 2 3 4 6 8 12 16 22 29 38 Total: 148 x 100 int
Proc 31: 0 0 0 0 0 0 0 0 0 1
2 3 4 6 8 12 16 22 29 38 50 Total: 198 x 100 int
....................................................................................................
Total time = 15.443502 seconds

For 1.6.1:

Open MPI 1.6.1
Proc 0: 50 38 29 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0
0 Total: 198 x 100 int
Proc 1: 38 29 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0
0 Total: 148 x 100 int
Proc 2: 29 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0
0 Total: 109 x 100 int
Proc 3: 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0
0 Total: 80 x 100 int
Proc 4: 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0
0 Total: 58 x 100 int
Proc 5: 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0
0 Total: 41 x 100 int
Proc 6: 8 6 4 3 2 1 0 0 0 0 0 0 0 0
0 Total: 29 x 100 int
Proc 7: 6 4 3 2 1 0 0 0 0 0 0 0 0
0 Total: 20 x 100 int
Proc 8: 4 3 2 1 0 0 0 0 0 0 0 0
0 Total: 14 x
100 int
Proc 9: 3 2 1 0 0 0 0 0 0 0 0
0 Total: 9 x
100 int
Proc 10: 2 1 0 0 0 0 0 0 0 0 0 Total: 6 x 100 int
Proc 11: 1 0 0 0 0 0 0 0 0
0 0
Total: 4 x 100 int
Proc 12: 0 0 0 0 0 0 0 0
0 0 0
Total: 2 x 100 int
Proc 13: 0 0 0 0 0 0 0
0 0 0 0
Total: 1 x 100 int
Proc 14: 0 0 0 0 0 0
0 0 0
0 0 Total: 1 x 100 int
Proc 15: 0 0 0 0 0
0 0 0
0 0 0 Total: 0 x 100 int
Proc 16: 0 0 0 0
0 0 0
0 0 0 0 Total: 0 x 100 int
Proc 17: 0 0 0
0 0 0
0 0 0 0 0 Total: 1 x 100 int
Proc 18: 0 0
0 0 0
0 0 0 0 0 0 Total: 1 x 100 int
Proc 19: 0
0 0 0
0 0 0 0 0 0 0 Total: 2 x 100 int
Proc 20:
0 0 0
0 0 0 0 0 0 0 1 Total: 4 x 100 int
Proc 21: 0 0 0 0 0 0 0 0 0 1 2 Total: 6 x 100 int
Proc 22: 0
0 0 0 0 0 0 0 0 1 2 3 Total: 9 x 100 int
Proc 23: 0 0 0 0 0 0 0 0 0 1 2 3 4 Total: 14 x 100 int
Proc 24: 0 0 0
0 0 0 0 0 0 1 2 3 4 6 Total: 20 x 100 int
Proc 25: 0 0 0 0
0 0 0 0 0 1 2 3 4 6 8 Total: 29 x 100 int
Proc 26: 0 0 0 0 0
0 0 0 0 1 2 3 4 6 8 12 Total: 41 x 100 int
Proc 27: 0 0 0 0 0 0
0 0 0 1 2 3 4 6 8 12 16 Total: 58 x 100 int
Proc 28: 0 0 0 0 0 0 0
0 0 1 2 3 4 6 8 12 16 22 Total: 80 x 100 int
Proc 29: 0 0 0 0 0 0 0 0
0 1 2 3 4 6 8 12 16 22 29 Total: 109 x 100 int
Proc 30: 0 0 0 0 0 0 0 0 0
1 2 3 4 6 8 12 16 22 29 38 Total: 148 x 100 int
Proc 31: 0 0 0 0 0 0 0 0 0 1
2 3 4 6 8 12 16 22 29 38 50 Total: 198 x 100 int
....................................................................................................
Total time = 25.549821 seconds




160_vs_162.png