Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Bad parallel scaling using Code Saturne with openmpi
From: Yvan Fournier (yvan.fournier_at_[hidden])
Date: 2012-07-11 21:20:37


On Jul 10, 2012, at 7:31 AM, Dugenoux Albert wrote:
>
> Hi.
> ?
> I have recently built a cluster upon a Dell PowerEdge Server with a Debian 6.0 OS. This server is composed of
> 4 system board of 2 processors of hexacores. So it gives 12 cores?per system board.
> The boards are linked with a local Gbits switch.
> ?
> In order to?parallelize the software Code Saturne, which is a CFD solver, I have configured?the cluster
> such that there are?a pbs server/mom on 1 system board and?3 mom and the?3 others cards. So this leads to
> 48 cores dispatched on?4 nodes of 12 CPU. Code saturne is compiled with the openmpi 1.6 version.
> ?
> When I launch a simulation using 2 nodes?with 12 cores,?elapse time is good and network traffic is not full.
> But when I launch the same simulation using 3 nodes with 8 cores, elapse time is 5 times the previous one.
> I?both cases, I use 24 cores and network seems not to be satured.
> ?
> I have tested several configurations : binaries in local file system or?on a NFS. But results are the same.
> I have visited severals forums (in particular http://www.open-mpi.org/community/lists/users/2009/08/10394.php)
> and read lots of threads, but as I am not an expert at clusters, I presently do not see where it?is wrong !
> ?
> Is it a problem in the configuration of PBS (I have installed it from the deb packages), a subtile compilation options
> of openMPI, or a bad?network configuration??
> ?
> Regards.
> ?
> B. S.
> ________________________________

Hello,

I am a Code_Saturne developer, so I can confirm a few comments from
others on this list:

- Most of the communication of the code is latency-bound: we use
iterative linear solvers, which make a heavy use of MPI_Allreduce, with
only 1 to 3 double precision values per reduction. I do not know if
modern "fast" Ethernet variants on a small number of switches make a big
difference, but tests made a few years ago on a Cluster using a a SCALI
network (fast/low latency at the time) led to the conclusion that the
code performance was divided by 2 on an Ethernet network. These tests
need to be updated, but your results seem consistent.

- Actually, on an Infiniband cluster using Open MPI 1.4.3 (such as the
one described here: http://i.top500.org/system/177030), performance
tends to be better in some cases when spreading a constant number of
cores on more nodes, as the code is quite memory-bandwidth intensive.
Depending on the data size on each node, this may be significant or lead
to only minor performance differences.
The network topology may also affect performance (tests using SLURMS
--switches options confirms this), as well as binding processes to
cores.

- In recent years, the code has been used and tested mainly on
workstations (shared memory), Infiniband clusters, or IBM Blue Gene (L,
P, and Q) or a Cray XT (5 and 6) then XE-6 machine. I am interested in
trying to improve or at least try to improve performance on Ethernet
clusters, and I may have a few suggestions for options you can test, but
this conversation should probably move to the Code_Saturne forum
(http://code-saturne.org), as we will go into some options of our linear
solvers which are specific to that code, not to Open MPI.

Best regards,

  Yvan Fournier