Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] How to improve non-blocking point-to-point communication scaling
From: Gus Correa (gus_at_[hidden])
Date: 2009-07-10 15:34:09


Dear OpenMPI experts

We are seeing bad scaling of a certain code that uses OpenMPI
non-blocking point-to-point routines,
and would love to hear any suggestions on how to improve the situation.

Details:

We have a small 24-node cluster (Monk) with Infiniband, dual AMD Opteron
quad-core processors, and we are using OpenMPI 1.3.2.

One of the codes we run here is the MITgcm.
The code is written in Fortran 77,
uses a standard domain decomposition technique, and (Open)MPI.

Some of the heavy lifting is done by a routine that solves the so-called
barotropic pressure equation (an elliptic PDE) using a conjugate
gradient technique, which typically takes 300 iterations at each
time step.

The pressure solver conjugate gradient routine uses
MPI point-to-point non-blocking communication
to exchange arrays across the subdomain boundaries.
There are calls to MPI_ISend, MPI_Recv, and MPI_Waitall only.
(There are a few MPI_Barrier also, but they seem to be inactive,
knocked out by suitable preprocessor directives.)

Problem:

One user noted that when he increases the number of processors,
the pressure solver takes a progressively larger share of the total
walltime, and this percentage is much larger than on other (public)
clusters.

Here is a typical result on our cluster (Monk):

Nodes --- Cores --- percent time taken by pressure solver
--1---------8---------5% (Note: IB not used, single node run)
--2---------16--------14%
--4---------32--------45%
--12--------96--------80%
(Note: fast increase of pressure solver %time with # cores used)

However, according to the same user, when he runs the same code in the
TACC Ranger and Lonestar clusters,
the percent runtime taken by the pressure solver
is a significantly smaller fraction of the total runtime, even when
the number of cores used is large.

Here are his results at TACC:

On LoneStar (dual Xeon dual core, Infiniband (?), MVAPICH2 (?) )
Nodes --- Cores --- percent time taken by pressure solver
--16-------64----------22%

On Ranger (dual Opteron quad core, Infiniband, MVAPICH2)
Nodes --- Cores --- percent time taken by pressure solver
--8--------64----------19%
--24------192----------35%

(Note: much smaller % than on our machine for the same number of cores.)

I wonder if there is any parameter that I can tweak in OpenMPI which
may reduce the percent time taken by the pressure solver.

Any suggestions are appreciated.

Many thanks,
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------