Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Bad parallel scaling using Code Saturne with openmpi
From: Gus Correa (gus_at_[hidden])
Date: 2012-07-11 16:55:06

Hi Dugenoux

On 07/11/2012 04:21 PM, Dugenoux Albert wrote:
> Hi.
> To answer the differents remarks :
> 1) Code Saturne launch itself embedded python and bash scripts with the
> mpiexec parameters, but I will test
> your parameter next week and will give you the result of this benchmark.

Yes, I already saw other programs which have launching wrappers also.
They can make it more difficult to change the mpiexec command line
options directly. But maybe there is a way to pass mpiexec parameters
as a string or similar.
Our cluster have AMD Opterons, and sometimes
'--bysocket --bind-to-socket' helps.
However, although it may be worth trying,
this may or may not be beneficial to the Intel processors,
I haven't used them for MPI in a while.

> 2) I do not think there is a problem with the load balancing : Code
> Saturne partitions itself
> the mesh with the reliable and well-known Metis library which is the
> graph partitioner. So CPU
> are equally busy.

Out of mere curiosity, yesterday I briefly checked the Code Saturne
user guide, and indeed I saw that it uses Metis for graph partitioning.
Not being familiar to CFD codes related to engineering, and
to irregular meshes, adaptive meshes, or to Metis, I still wonder
if Metis [and Code Saturne] may accept some parameters
to help it do its partitioning job.
[Here we do Earth Science type of CFD, a.k.a GFD or geophysical fluid
dynamics, mostly atmosphere, ocean and climate dynamics,
but sometimes also solid earth - mantle convection, volcanism, etc -
However, the Navier-Stokes equation is reduced by a number of
approximations: Boussinesq, sometimes hydrostatic, turbulence
is parametrized, small wave numbers are damped, grids/meshes
are typically fixed and simple, etc.]

> 3) CPUs are Xeon which have multithreading capabilities. However I have
> tested it
> by setting np=24 in the server_priv/nodes file of the PBS server, and
> compared that
> with a configuration of np=12. The results are very similar : there is
> no gain of 20% or 30%

I just wonder if hyperthreading was really turned on during your tests.
Often times it comes turned on in the computer BIOS, but sometimes
it is turned off.
Did you check that out?
It is not enough to have Intel processors, hyperthreading must be
enabled, and often times this is done in the BIOS,
which you can access and change during bootup [pressing F1, F2 or DEL,
depending on the BIOS manufacturer].
Hyperthreading won't help with the cost of MPI communication,
but it may expedite a bit the computation part of the algorithm.
However, it may well be that Gigabit Ethernet is killing the

Have you tried to run in a single node, avoiding any
network communication cost, just for kicks?
That may give you a baseline to compare to runs across nodes.

> 4) I will examine the hardware options as you have suggested but I will
> have to convince my
> office for such investissment !

I know that problem.
I am always told to 'get the most bang for the buck',
and I am often given the least buck and required to produce a big bang!
Rest assured that I hold no stock of that Internet vendor Colfax,
or manufacturers of Infinband products [Mellanox,
QLogic=now Intel].
That was just meant to give you an idea of prices.
In France you probably have different vendors, but I believe the
manufacturers are pretty much the same.
BTW, I forgot to mention that besides the switch and HCA adapters,
you also need to buy the Infiniband cables! :)

I hope that helps,
Gus Correa

> ------------------------------------------------------------------------
> *De :* Gus Correa <gus_at_[hidden]>
> *À :* Open MPI Users <users_at_[hidden]>
> *Envoyé le :* Mercredi 11 juillet 2012 0h51
> *Objet :* Re: [OMPI users] Bad parallel scaling using Code Saturne with
> openmpi
> On 07/10/2012 05:31 PM, Jeff Squyres wrote:
> > +1. Also, not all Ethernet switches are created equal --
> > particularly commodity 1GB Ethernet switches.
> > I've seen plenty of crappy Ethernet switches rated for 1GB
> > that could not reach that speed when under load.
> >
> Are you perhaps belittling my dear $43 [brand undisclosed]
> 5-port GigE SoHo switch, that connects my Pentium-III
> toy cluster, just because it drops a few packages [per microsec]?
> It looks so good, with all those fiercely blinking green LEDs.
> Where else could I fool around with cluster setup and test
> the OpenMPI new releases? :)
> The production cluster is just too crowded for this,
> maybe because it has a decent
> HP GigE switch [IO] and Infiniband [MPI] ...
> Gus
> >
> >
> > On Jul 10, 2012, at 10:47 AM, Ralph Castain wrote:
> >
> >> I suspect it mostly reflects communication patterns. I don't know
> anything about Saturne, but shared memory is a great deal faster than
> TCP, so the more processes sharing a node the better. You may also be
> hitting some natural boundary in your model - perhaps with 8
> processes/node you wind up with more processes that cross the node
> boundary, further increasing the communication requirement.
> >>
> >> Do things continue to get worse if you use all 4 nodes with 6
> processes/node?
> >>
> >>
> >> On Jul 10, 2012, at 7:31 AM, Dugenoux Albert wrote:
> >>
> >>> Hi.
> >>>
> >>> I have recently built a cluster upon a Dell PowerEdge Server with a
> Debian 6.0 OS. This server is composed of
> >>> 4 system board of 2 processors of hexacores. So it gives 12 cores
> per system board.
> >>> The boards are linked with a local Gbits switch.
> >>>
> >>> In order to parallelize the software Code Saturne, which is a CFD
> solver, I have configured the cluster
> >>> such that there are a pbs server/mom on 1 system board and 3 mom
> and the 3 others cards. So this leads to
> >>> 48 cores dispatched on 4 nodes of 12 CPU. Code saturne is compiled
> with the openmpi 1.6 version.
> >>>
> >>> When I launch a simulation using 2 nodes with 12 cores, elapse time
> is good and network traffic is not full.
> >>> But when I launch the same simulation using 3 nodes with 8 cores,
> elapse time is 5 times the previous one.
> >>> I both cases, I use 24 cores and network seems not to be satured.
> >>>
> >>> I have tested several configurations : binaries in local file
> system or on a NFS. But results are the same.
> >>> I have visited severals forums (in particular
> >>> and read lots of threads, but as I am not an expert at clusters, I
> presently do not see where it is wrong !
> >>>
> >>> Is it a problem in the configuration of PBS (I have installed it
> from the deb packages), a subtile compilation options
> >>> of openMPI, or a bad network configuration ?
> >>>
> >>> Regards.
> >>>
> >>> B. S.
> >>> _______________________________________________
> >>> users mailing list
> >>> users_at_[hidden] <mailto:users_at_[hidden]>
> >>>
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden] <mailto:users_at_[hidden]>
> >>
> >
> >
> _______________________________________________
> users mailing list
> users_at_[hidden] <mailto:users_at_[hidden]>
> _______________________________________________
> users mailing list
> users_at_[hidden]