Bonsoir Eugene,

 First thanks for trying to help me.

 I already gave a try to some profiling tool, namely IPM, which is rather
simple to use. Here follows some output for a 1024 core run.
Unfortunately, I'm yet unable to have the equivalent MPT chart.

#IPMv0.983####################################################################
#
# command : unknown (completed)
# host    : r34i0n0/x86_64_Linux           mpi_tasks : 1024 on 128 nodes
# start   : 12/21/10/13:18:09              wallclock : 3357.308618 sec
# stop    : 12/21/10/14:14:06              %comm     : 27.67
# gbytes  : 0.00000e+00 total              gflop/sec : 0.00000e+00 total
#
##############################################################################
# region  : *       [ntasks] =   1024
#
#                           [total]         <avg>           min           max
# entries                       1024             1             1             1
# wallclock              3.43754e+06       3356.98       3356.83       3357.31
# user                   2.82831e+06       2762.02       2622.04       2923.37
# system                      376230       367.412       174.603       492.919
# mpi                         951328       929.031       633.137       1052.86
# %comm                                    27.6719       18.8601        31.363
# gflop/sec                        0             0             0             0
# gbytes                           0             0             0             0
#
#
#                            [time]       [calls]        <%mpi>      <%wall>
# MPI_Waitall                 741683   7.91081e+07         77.96        21.58
# MPI_Allreduce               114057   2.53665e+07         11.99         3.32
# MPI_Recv                   40164.7          2048          4.22         1.17
# MPI_Isend                  27420.6   6.53513e+08          2.88         0.80
# MPI_Barrier                25113.5          2048          2.64         0.73
# MPI_Sendrecv                2123.6        212992          0.22         0.06
# MPI_Irecv                  464.616   6.53513e+08          0.05         0.01
# MPI_Reduce                 215.447        171008          0.02         0.01
# MPI_Bcast                  85.0198          1024          0.01         0.00
# MPI_Send                  0.377043          2048          0.00         0.00
# MPI_Comm_rank          0.000744925          4096          0.00         0.00
# MPI_Comm_size          0.000252183          1024          0.00         0.00
###############################################################################


 It seems to my non-expert eye that MPI_Waitall is dominant among MPI calls,
but not for the overall application, however I will have to compare with MPT,
before concluding.

 Thanks again for your suggestions, that I'll address one by one.

 Best,     G.

 


Le 22/12/2010 18:50, Eugene Loh a écrit :
Can you isolate a bit more where the time is being spent?  The performance effect you're describing appears to be drastic.  Have you profiled the code?  Some choices of tools can be found in the FAQ http://www.open-mpi.org/faq/?category=perftools  The results may be "uninteresting" (all time spent in your MPI_Waitall calls, for example), but it'd be good to rule out other possibilities (e.g., I've seen cases where it's the non-MPI time that's the culprit).

If all the time is spent in MPI_Waitall, then I wonder if it would be possible for you to reproduce the problem with just some MPI_Isend|Irecv|Waitall calls that mimic your program.  E.g., "lots of short messages", or "lots of long messages", etc.  It sounds like there is some repeated set of MPI exchanges, so maybe that set can be extracted and run without the complexities of the application.

Anyhow, some profiling might help guide one to the problem.

Gilbert Grosdidier wrote:

There are indeed a high rate of communications. But the buffer
size is always the same for a given pair of processes, and I thought
that mpi_leave_pinned should avoid freeing the memory in this case.
Am I wrong ?