Quoting Victor <victor.major_at_[hidden]>:
> Thanks for the reply Reuti,
> There are two machines: Node1 with 12 physical cores (dual 6 core Xeon) and
Do you have this CPU?
> Node2 with 4 physical cores (i5-2400).
> Regarding scaling on the single 12 core node, not it is also not linear. In
> fact it is downright strange. I do not remember the numbers right now but
> 10 jobs are faster than 11 and 12 are the fastest with peak performance of
> approximately 66 Msu/s which is also far from triple the 4 core
> performance. This odd non-linear behaviour also happens at the lower job
> counts on that 12 core node. I understand the decrease in scaling with
> increase in core count on the single node as the memory bandwidth is an
> On the 4 core machine the scaling is progressive, ie. every additional job
> brings an increase in performance. Single core delivers 8.1 Msu/s while 4
> cores deliver 30.8 Msu/s. This is almost linear.
> Since my original email I have also installed Open-MX and recompiled
> OpenMPI to use it. This has resulted in approximately 10% better
> performance using the existing GbE hardware.
> On 29 January 2014 19:40, Reuti <reuti_at_[hidden]> wrote:
>> Am 29.01.2014 um 03:00 schrieb Victor:
>> > I am running a CFD simulation benchmark cavity3d available within
>> > It is a parallel friendly Lattice Botlzmann solver library.
>> > Palabos provides benchmark results for the cavity3d on several different
>> platforms and variables here:
>> > The problem that I have is that the benchmark performance on my cluster
>> does not scale even close to a linear scale.
>> > My cluster configuration:
>> > Node1: Dual Xeon 5560 48 Gb RAM
>> > Node2: i5-2400 24 Gb RAM
>> > Gigabit ethernet connection on eth0
>> > OpenMPI 1.6.5 on Ubuntu 12.04.3
>> > Hostfile:
>> > Node1 -slots=4 -max-slots=4
>> > Node2 -slots=4 -max-slots=4
>> > MPI command: mpirun --mca btl_tcp_if_include eth0 --hostfile
>> /home/mpiuser/.mpi_hostfile -np 8 ./cavity3d 400
>> > Problem:
>> > cavity3d 400
>> > When I run mpirun -np 4 on Node1 I get 35.7615 Mega site updates per
>> > When I run mpirun -np 4 on Node2 I get 30.7972 Mega site updates per
>> > When I run mpirun --mca btl_tcp_if_include eth0 --hostfile
>> /home/mpiuser/.mpi_hostfile -np 8 ./cavity3d 400 I get 47.3538 Mega site
>> updates per second
>> > I understand that there are latencies with GbE and that there is MPI
>> overhead, but this performance scaling still seems very poor. Are my
>> expectations of scaling naive, or is there actually something wrong and
>> fixable that will improve the scaling? Optimistically I would like each
>> node to add to the cluster performance, not slow it down.
>> > Things get even worse if I run asymmetric number of mpi jobs in each
>> node. For instance running -np 12 on Node1
>> Isn't this overloading the machine with only 8 real cores in total?
>> > is significantly faster than running -np 16 across Node1 and Node2, thus
>> adding Node2 actually slows down the performance.
>> The i5-2400 has only 4 cores and no threads.
>> It depends on the algorithm how much data has to be exchanged between the
>> processes, and this can indeed be worse when used across a network.
>> Also: is the algorithm scaling linear when used on node1 only with 8
>> cores? When it's "35.7615 " with 4 cores, what result do you get with 8
>> cores on this machine.
>> -- Reuti
>> users mailing list