Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Bad parallel scaling using Code Saturne with openmpi
From: Gus Correa (gus_at_[hidden])
Date: 2012-07-10 18:06:16

On 07/10/2012 03:54 PM, David Warren wrote:
> Your problem may not be related to bandwidth. It may be latency or
> division of the problem. We found significant improvements running wrf
> and other atmospheric code (CFD) over IB. The problem was not so much
> the amount of data communicated, but how long it takes to send it. Also,
> is your model big enough to split up as much as you are trying? If there
> is not enough work for each core to do between edge exchanges, you will
> spend all of your time spinning waiting for the network. If you are
> running a demo benchmark it is likely too small for the number of
> processors. At least that is what we find with most weather model demo
> problems. One other thing to look at is how it is being split up.
> Depending on what the algorithm does, you may want more x points, more y
> points or completely even divisions. We found that we can significantly
> speed up wrf for our particular domain by not lett
> On 07/10/12 08:48, Dugenoux Albert wrote:
>> Thanks for your answer.You are right.
>> I've tried upon 4 nodes with 6 processes and things are worst.
>> So do you suggest that unique thing to do is to order an infiniband
>> switch or is there a possibility to enhance
>> something by tuning mca parameters ?
>> *De :* Ralph Castain <rhc_at_[hidden]>
>> *À :* Dugenoux Albert <dugenouxa_at_[hidden]>; Open MPI Users
>> <users_at_[hidden]>
>> *Envoyé le :* Mardi 10 juillet 2012 16h47
>> *Objet :* Re: [OMPI users] Bad parallel scaling using Code Saturne
>> with openmpi
>> I suspect it mostly reflects communication patterns. I don't know
>> anything about Saturne, but shared memory is a great deal faster than
>> TCP, so the more processes sharing a node the better. You may also be
>> hitting some natural boundary in your model - perhaps with 8
>> processes/node you wind up with more processes that cross the node
>> boundary, further increasing the communication requirement.
>> Do things continue to get worse if you use all 4 nodes with 6
>> processes/node?
>> On Jul 10, 2012, at 7:31 AM, Dugenoux Albert wrote:
>>> Hi.
>>> I have recently built a cluster upon a Dell PowerEdge Server with a
>>> Debian 6.0 OS. This server is composed of
>>> 4 system board of 2 processors of hexacores. So it gives 12 cores per
>>> system board.
>>> The boards are linked with a local Gbits switch.
>>> In order to parallelize the software Code Saturne, which is a CFD
>>> solver, I have configured the cluster
>>> such that there are a pbs server/mom on 1 system board and 3 mom and
>>> the 3 others cards. So this leads to
>>> 48 cores dispatched on 4 nodes of 12 CPU. Code saturne is compiled
>>> with the openmpi 1.6 version.
>>> When I launch a simulation using 2 nodes with 12 cores, elapse time
>>> is good and network traffic is not full.
>>> But when I launch the same simulation using 3 nodes with 8 cores,
>>> elapse time is 5 times the previous one.
>>> I both cases, I use 24 cores and network seems not to be satured.
>>> I have tested several configurations : binaries in local file system
>>> or on a NFS. But results are the same.
>>> I have visited severals forums (in particular
>>> and read lots of threads, but as I am not an expert at clusters, I
>>> presently do not see where it is wrong !
>>> Is it a problem in the configuration of PBS (I have installed it from
>>> the deb packages), a subtile compilation options
>>> of openMPI, or a bad network configuration ?
>>> Regards.

Hi Albert

1) Have you tried to bind processes to cores:

mpiexec -bycore -bind-to-core -np ...

Sometimes it improves performance.

2) Packing as much work as possible in the fewest nodes
[and reducing MPI communication to mostly shared memory]
always helps, as Ralph said.

3) Infinband low latency and bandwith helps, as David said.
It will cost you a switch plus a HCA card adapter for each node.
This can give you an idea of prices:

4) Not all code scales well with the number of processors.
The norm is to be sub-linear, even when the problem is large enough
to avoid the 'too much communication per computation' situation that
David referred to.
Some have humps and bumps and sweet spots in the scaling curve,
which may depend on the specifics of domain dimensions, etc.

5) However, your CFD code may have some parameters/knobs to
control the way the mesh is decomposed, and how the subdomains
are distributed across the processors,
which may help you find the load balance for
each problem type and size. This comes part by trial-and-error,
part by educated guesses.

I hope this helps,
Gus Correa

PS - Are your processors Intel [Nehalem or later]?
Is hyperthreading turned on [on BIOS perhaps]?
This would give you 24 virtual cores per node.
With hyperthreading turned on
you may need to adjust your $TORQUE_PBS/server_priv/nodes
file to read something like:
node01 np=24
instead of np=12, and then restart the pbs_server.
It may be worth testing the code both ways,
as sometimes hyperthreading sucks,
sometimes it shines, depending on the code.
[Or better, it may not give you
twice the speed, but may give you 20-30% more speed,
which is not negligible.]