Your problem may not be related to bandwidth. It may be latency or division of the problem. We found significant improvements running wrf and other atmospheric code (CFD) over IB. The problem was not so much the amount of data communicated, but how long it takes to send it. Also, is your model big enough to split up as much as you are trying? If there is not enough work for each core to do between edge exchanges, you will spend all of your time spinning waiting for the network. If you are running a demo benchmark it is likely too small for the number of processors. At least that is what we find with most weather model demo problems. One other thing to look at is how it is being split up. Depending on what the algorithm does, you may want more x points, more y points or completely even divisions. We found that we can significantly speed up wrf for our particular domain by not lett

On 07/10/12 08:48, Dugenoux Albert wrote:
Thanks for your answer.You are right.
 I've tried upon 4 nodes with 6 processes and things are worst.
So do you suggest that unique thing to do is to order an infiniband switch or is there a possibility to enhance
something by tuning mca parameters ?

De : Ralph Castain <>
À : Dugenoux Albert <>; Open MPI Users <>
Envoyé le : Mardi 10 juillet 2012 16h47
Objet : Re: [OMPI users] Bad parallel scaling using Code Saturne with openmpi

I suspect it mostly reflects communication patterns. I don't know anything about Saturne, but shared memory is a great deal faster than TCP, so the more processes sharing a node the better. You may also be hitting some natural boundary in your model - perhaps with 8 processes/node you wind up with more processes that cross the node boundary, further increasing the communication requirement.

Do things continue to get worse if you use all 4 nodes with 6 processes/node?

On Jul 10, 2012, at 7:31 AM, Dugenoux Albert wrote:

I have recently built a cluster upon a Dell PowerEdge Server with a Debian 6.0 OS. This server is composed of
4 system board of 2 processors of hexacores. So it gives 12 cores per system board.
The boards are linked with a local Gbits switch.
In order to parallelize the software Code Saturne, which is a CFD solver, I have configured the cluster
such that there are a pbs server/mom on 1 system board and 3 mom and the 3 others cards. So this leads to
48 cores dispatched on 4 nodes of 12 CPU. Code saturne is compiled with the openmpi 1.6 version.
When I launch a simulation using 2 nodes with 12 cores, elapse time is good and network traffic is not full.
But when I launch the same simulation using 3 nodes with 8 cores, elapse time is 5 times the previous one.
I both cases, I use 24 cores and network seems not to be satured.
I have tested several configurations : binaries in local file system or on a NFS. But results are the same.
I have visited severals forums (in particular
and read lots of threads, but as I am not an expert at clusters, I presently do not see where it is wrong !
Is it a problem in the configuration of PBS (I have installed it from the deb packages), a subtile compilation options
of openMPI, or a bad network configuration ?
B. S.
users mailing list

_______________________________________________ users mailing list