Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores: very poor performance
From: Eugene Loh (eugene.loh_at_[hidden])
Date: 2010-12-22 14:04:43


Gilbert Grosdidier wrote:
Bonsoir Eugene,
Bon matin chez moi.
Here follows some output for a 1024 core run.
Assuming this corresponds meaningfully with your original e-mail, 1024 cores means performance of 700 vs 900.  So, that looks roughly consistent with the 28% MPI time you show here.  That seems to imply that the slowdown is due entirely to long MPI times (rather than slow non-MPI times).  Just a sanity check.
Unfortunately, I'm yet unable to have the equivalent MPT chart.
That may be all right.  If one run clearly shows a problem (which is perhaps the case here), then a "good profile" is not needed.  Here, a "good profile" would perhaps be used only to confirm that near-zero MPI time is possible.
#IPMv0.983####################################################################
# host    : r34i0n0/x86_64_Linux           mpi_tasks : 1024 on 128 nodes
# start   : 12/21/10/13:18:09              wallclock : 3357.308618 sec
# stop    : 12/21/10/14:14:06              %comm     : 27.67
##############################################################################
#
#                           [total]         <avg>           min           max
# wallclock              3.43754e+06       3356.98       3356.83       3357.31
# user                   2.82831e+06       2762.02       2622.04       2923.37
# system                      376230       367.412       174.603       492.919
# mpi                         951328       929.031       633.137       1052.86
# %comm                                    27.6719       18.8601        31.363
No glaring evidence here of load imbalance being the sole explanation, but hard to tell from these numbers.  (If min comm time is 0%, then that process is presumably holding everyone else up.)
#                             [time]       [calls]        <%mpi>      <%wall>
# MPI_Waitall                 741683   7.91081e+07         77.96        21.58
# MPI_Allreduce               114057   2.53665e+07         11.99         3.32
# MPI_Isend                  27420.6   6.53513e+08          2.88         0.80
# MPI_Irecv                  464.616   6.53513e+08          0.05         0.01
###############################################################################


It seems to my non-expert eye that MPI_Waitall is dominant among MPI calls,
but not for the overall application,
If at 1024 cores, performance is 700 compared to 900, then whatever the problem is still hasn't dominated the entire application performance.  So, it looks like MPI_Waitall is the problem, even if it doesn't dominate overall application time.

Looks like on average each MPI_Waitall call is completing 8+ MPI_Isend calls and 8+ MPI_Irecv calls.  I think IPM gives some point-to-point messaging information.  Maybe you can tell what the distribution is of message sizes, etc.  Or, maybe you already know the characteristic pattern.  Does a stand-alone message-passing test (without the computational portion) capture the performance problem you're looking for?
Le 22/12/2010 18:50, Eugene Loh a écrit :
Can you isolate a bit more where the time is being spent?  The performance effect you're describing appears to be drastic.  Have you profiled the code?  Some choices of tools can be found in the FAQ http://www.open-mpi.org/faq/?category=perftools  The results may be "uninteresting" (all time spent in your MPI_Waitall calls, for example), but it'd be good to rule out other possibilities (e.g., I've seen cases where it's the non-MPI time that's the culprit).

If all the time is spent in MPI_Waitall, then I wonder if it would be possible for you to reproduce the problem with just some MPI_Isend|Irecv|Waitall calls that mimic your program.  E.g., "lots of short messages", or "lots of long messages", etc.  It sounds like there is some repeated set of MPI exchanges, so maybe that set can be extracted and run without the complexities of the application.