what I am reporting definitely is an openmpi scaling problem. The 32
waters problem I am talking about does scale to 64 cores, as clearly
shown by the numbers I posted, if I use IntelMPI (or mvapich) instead
of openmpi, on the same hardware, same code, same compiler, same Intel
mkl libraries, same input. It is also known known to scale well to 64
cores with openmpi on different parallel machine, as confirmed by a
friend of mine. I know CPMD well and have been using it since long
time. The problem is why on my new cluster when running on 64 cores
with openmpi it runs longer than on 32 cores while with IntelMPI on 64
cores it runs 2 times faster than on 32, as it should. Could there be
some misconfiguration of openmpi ?
On Sat, May 16, 2009 at 2:05 AM, Gus Correa <gus_at_[hidden]> wrote:
> Hi Roman
> I googled out and found that CPMD is a molecular dynamics program.
> (What would be of civilization without Google?)
> Unfortunately I kind of wiped off from my mind
> Schrodinger's equation, Quantum Mechanics,
> and the Born approximation,
> which I learned probably before you were born.
> I couldn't find any short description of the CPMD algorithm,
> or a good diagram with the mesh, problem size, whatever that might
> clarify how the algorithm works.
> Hence, it is still hard for me to tell what type of scaling of CPMD
> to expect.
> There must be something in the manual, if you read it.
> However, they do mention mesh size, problem size, etc,
> several times, and if you dig out, you'll find the parameters
> that control scaling.
> The topmost parameter that should control scaling is probably the
> problem size or mesh size, but there may be other things,
> such as their controlling variables "taskgroups",
> "blocksize states", etc.
> Not knowing the algorithm I can only guess.
> Anyway, the CPMD manual has
> some recommendations on how to divide the task
> across processors in a meaningful (and efficient) way,
> which you should read:
> CPMD on Parallel Computers
> Paralellizing CPMD with MPI:
> For some reason, all the "CPMD 32 water" benchmarks on their
> web page stop at 32 processors (except for one with 128 processors ran on a
> which is a different beast than a beowulf cluster).
> I suggest that you read the "Parallel Performance" section and figures,
> as some sentences clearly indicate that some problem sizes are not
> large enough to require more than a few tens of processors:
> Is this perhaps because the size of the problem doesn't justify using
> more than 32 processors?
> What is the meaning of the "32" on "CPMD 32 water"?
> I hope this helps,
> Gus Correa
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> Gus Correa wrote:
>> Hi Roman
>> Just a guess.
>> Is this a domain decomposition code?
>> (I never heard about "cpmd 32 waters" before, sorry.)
>> Is it based on finite differences, finite volume, finite element?
>> If it is, once the size of the subdomains becomes too small compared to
>> the size of the halo around them, the overhead required to calculate
>> your solution for the halo swamps the whole calculation,
>> and scaling degrades.
>> This is not an MPI scaling problem, this is intrinsic to the domain
>> decomposition technique.
>> Typically this happens as the number of processors reach some high number
>> (which depends on the size of the problem).
>> So, what you are seeing may not be a problem with OpenMPI scaling,
>> but just that your problem is not large enough to require the use of, say,
>> 48 or 64 processors.
>> For instance, imagine a 1D problem with a grid with 1024 points,
>> that require a 2 grid point overlap (halo) on the left and right
>> of any subdomain to be calculated in parallel (i.e. decomposing the domain
>> in parts).
>> If you divide the domain across two processors only, each processor
>> has to work not on 1024/2=512 points, but on 512+2+2=516 points.
>> The calculation on the two processors gets an overhead of 2*(2+2)=8 grid
>> points,w.r.t. the same calculation done on a single processor.
>> This is an overhead of 8/1024=0.8% only, so using 2 processors
>> will speedup the calculation by a factor close to 2 (but slightly lower).
>> However, if you divide the same problem across 64 subdomains (i.e 64
>> processors), the size of each subdomain is 1024/64=16,
>> plus 2 halo grid point on each side, i.e. 20 grid points.
>> So the overhead is much higher now, 4/16=25%.
>> Dividing the problem across 64 processors will not speed it up by
>> a factor of 64, but by much less.
>> Every domain decomposition program that we have here shows this
>> effect. If we give them more processors they scale well, up
>> to a point (say 16 or 32 processors, for a reasonably sized problem).
>> However, beyond that point the scaling slowly flattens out.
>> When you go and look at the grid size and the
>> large number of processors,
>> you realize that most of the effort is being done to calculate halos,
>> i.e. on overhead.
>> On top of that, there is the overhead due to MPI communication, of
>> course, but it is likely that the halo overhead is the dominant factor.
>> I would guess other classes of problems and parallel methods of solution
>> also have the same problem that domain decomposition shows.
>> Is this perhaps what is going on with your test code?
>> Take a look at the code to see what it is doing,
>> and in particularly, what is the problem size.
>> See if it really makes sense to distribute it over 64 processors,
>> of if a smaller number would be the right choice.
>> Also, if the program allows you to change the problem size,
>> try the test again with a larger problem size
>> (say, two or four times bigger),
>> and then go up to a large number of processors also.
>> With a larger problem size the scaling may be better too
>> (but the runtimes will grow as well).
>> Finally, since you are using Infiniband, and I wonder if all the
>> nodes connect to each other with the same latency, or if some
>> pairs of nodes have higher latency to communicate.
>> On a single switch hopefully the latency is the same for all pairs of
>> However, if you connect two switches, for instance, nodes that
>> are on switch A will probably have a larger latency to talk
>> to nodes on switch B, I suppose.
>> I hope it helps.
>> Gus Correa
>> Gustavo Correa
>> Lamont-Doherty Earth Observatory - Columbia University
>> Palisades, NY, 10964-8000 - USA
>> Roman Martonak wrote:
>>> I observe very poor scaling with openmpi on HP blade system consisting
>>> of 8 blades (each having 2 quad-core AMD Barcelona 2.2 GHz CPU) and
>>> interconnected with Infiniband fabric. When running the standard cpmd
>>> 32 waters test, I observe the following scaling (the numbers are
>>> elapsed time)
>>> using full blades (8 cores per blade)
>>> np8 7 MINUTES 26.40 SECONDS
>>> np16 4 MINUTES 19.91 SECONDS
>>> np32 2 MINUTES 55.51 SECONDS
>>> np48 2 MINUTES 38.18 SECONDS
>>> np64 3 MINUTES 19.78 SECONDS
>>> I tried also openmpi-1.2.8 and openmpi-1.3 and it is about the same,
>>> openmpi-1.3 is somewhat better for 32 cores but in all cases there is
>>> practically no scaling beyond 4 blades (32 cores) and running on 64
>>> cores is a disaster. With Intel MPI, however, I get the following
>>> Intel MPI-3.2.1.009
>>> using full blades (8 cores per blade)
>>> np8 7 MINUTES 23.19 SECONDS
>>> np16 4 MINUTES 22.17 SECONDS
>>> np32 2 MINUTES 50.07 SECONDS
>>> np48 1 MINUTES 42.87 SECONDS
>>> np64 1 MINUTES 23.76 SECONDS
>>> so there is reasonably good scaling up to 64 cores. I am running with
>>> the option
>>> --mca mpi_paffinity_alone 1, I have tried also -mca btl_openib_use_srq
>>> 1 but it had only marginal effect. With mvapich I get similar scaling
>>> as with Intel MPI. The system is running the Rocksclusters
>>> distribution 5.1 with the mellanox ofed-1.4 roll. I would be grateful
>>> if somebody could suggest me what could be the origin of the problem
>>> and how to tune openmpi to get better scaling.
>>> Many thanks in advance.
>>> Best regards
>>> users mailing list
>> users mailing list
> users mailing list