On Jan 29, 2014, at 7:56 PM, Victor <victor.major@gmail.com> wrote:

Thanks for the insights Tim. I was aware that the CPUs will choke beyond a certain point. From memory on my machine this happens with 5 concurrent MPI jobs with that benchmark that I am using.

My primary question was about scaling between the nodes. I was not getting close to double the performance when running MPI jobs acros two 4 core nodes. It may be better now since I have Open-MX in place, but I have not repeated the benchmarks yet since I need to get one simulation job done asap.

Some of that may be due to expected loss of performance when you switch from shared memory to inter-node transports. While it is true about saturation of the memory path, what you reported could be more consistent with that transition - i.e., it isn't unusual to see applications perform better when run on a single node, depending upon how they are written, up to a certain size of problem (which your code may not be hitting).


Regarding your mention of setting affinities and MPI ranks do you have a specific (as in syntactically specific since I am a novice and easily confused...) examples how I may want to set affinities to get the Westmere node performing better?

mpirun --bind-to-core -cpus-per-rank 2 ...

will bind each MPI rank to 2 cores. Note that this will definitely *not* be a good idea if you are running more than two threads in your process - if you are, then set --cpus-per-rank to the number of threads, keeping in mind that you want things to break evenly across the sockets. In other words, if you have two 6 core/socket Westmere's on the node, then you either want to run 6 process at cpus-per-rank=2 if each process runs 2 threads, or 4 processes with cpus-per-rank=3 if each process runs 3 threads, or 2 processes with no cpus-per-rank but --bind-to-socket instead of --bind-to-core for any other thread number > 3.

You would not want to run any other number of processes on the node or else the binding pattern will cause a single process to split its threads across the sockets - which will definitely hurt performance.



ompi_info returns this: MCA paffinity: hwloc (MCA v2.0, API v2.0, Component v1.6.5)

And finally to hybridisation... in a week or so I will get 4 AMD A10-6800 machines with 8Gb each on loan and will attempt to make them work along the existing Intel nodes. 

Victor


On 29 January 2014 22:03, Tim Prince <n8tm@aol.com> wrote:

On 1/29/2014 8:02 AM, Reuti wrote:
Quoting Victor <victor.major@gmail.com>:

Thanks for the reply Reuti,

There are two machines: Node1 with 12 physical cores (dual 6 core Xeon) and

Do you have this CPU?

http://ark.intel.com/de/products/37109/Intel-Xeon-Processor-X5560-8M-Cache-2_80-GHz-6_40-GTs-Intel-QPI

-- Reuti

It's expected on the Xeon Westmere 6-core CPUs to see MPI performance saturating when all 4 of the internal buss paths are in use.  For this reason, hybrid MPI/OpenMP with 2 cores per MPI rank, with affinity set so that each MPI rank has its own internal CPU buss, could out-perform plain MPI on those CPUs.
That scheme of pairing cores on selected internal buss paths hasn't been repeated.  Some influential customers learned to prefer the 4-core version of that CPU, given a reluctance to adopt MPI/OpenMP hybrid with affinity.
If you want to talk about "downright strange," start thinking about the schemes to optimize performance of 8 threads with 2 threads assigned to each internal CPU buss on that CPU model.  Or your scheme of trying to balance MPI performance between very different CPU models.
Tim


Node2 with 4 physical cores (i5-2400).

Regarding scaling on the single 12 core node, not it is also not linear. In
fact it is downright strange. I do not remember the numbers right now but
10 jobs are faster than 11 and 12 are the fastest with peak performance of
approximately 66 Msu/s which is also far from triple the 4 core
performance. This odd non-linear behaviour also happens at the lower job
counts on that 12 core node. I understand the decrease in scaling with
increase in core count on the single node as the memory bandwidth is an
issue.

On the 4 core machine the scaling is progressive, ie. every additional job
brings an increase in performance. Single core delivers 8.1 Msu/s while 4
cores deliver 30.8 Msu/s. This is almost linear.

Since my original email I have also installed Open-MX and recompiled
OpenMPI to use it. This has resulted in approximately 10% better
performance using the existing GbE hardware.


On 29 January 2014 19:40, Reuti <reuti@staff.uni-marburg.de> wrote:

Am 29.01.2014 um 03:00 schrieb Victor:

> I am running a CFD simulation benchmark cavity3d available within
http://www.palabos.org/images/palabos_releases/palabos-v1.4r1.tgz
>
> It is a parallel friendly Lattice Botlzmann solver library.
>
> Palabos provides benchmark results for the cavity3d on several different
platforms and variables here:
http://wiki.palabos.org/plb_wiki:benchmark:cavity_n400
>
> The problem that I have is that the benchmark performance on my cluster
does not scale even close to a linear scale.
>
> My cluster configuration:
>
> Node1: Dual Xeon 5560 48 Gb RAM
> Node2: i5-2400 24 Gb RAM
>
> Gigabit ethernet connection on eth0
>
> OpenMPI 1.6.5 on Ubuntu 12.04.3
>
>
> Hostfile:
>
> Node1 -slots=4 -max-slots=4
> Node2 -slots=4 -max-slots=4
>
> MPI command: mpirun --mca btl_tcp_if_include eth0 --hostfile
/home/mpiuser/.mpi_hostfile -np 8 ./cavity3d 400
>
> Problem:
>
> cavity3d 400
>
> When I run mpirun -np 4 on Node1 I get 35.7615 Mega site updates per
second
> When I run mpirun -np 4 on Node2 I get 30.7972 Mega site updates per
second
> When I run mpirun --mca btl_tcp_if_include eth0 --hostfile
/home/mpiuser/.mpi_hostfile -np 8 ./cavity3d 400 I get 47.3538 Mega site
updates per second
>
> I understand that there are latencies with GbE and that there is MPI
overhead, but this performance scaling still seems very poor. Are my
expectations of scaling naive, or is there actually something wrong and
fixable that will improve the scaling? Optimistically I would like each
node to add to the cluster performance, not slow it down.
>
> Things get even worse if I run asymmetric number of mpi jobs in each
node. For instance running -np 12 on Node1

Isn't this overloading the machine with only 8 real cores in total?


> is significantly faster than running -np 16 across Node1 and Node2, thus
adding Node2 actually slows down the performance.

The i5-2400 has only 4 cores and no threads.

It depends on the algorithm how much data has to be exchanged between the
processes, and this can indeed be worse when used across a network.

Also: is the algorithm scaling linear when used on node1 only with 8
cores? When it's "35.7615 " with 4 cores, what result do you get with 8
cores on this machine.

-- Reuti
_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Tim Prince


_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users