Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Running on two nodes slower than running on one node
From: Victor (victor.major_at_[hidden])
Date: 2014-01-29 22:56:39


Thanks for the insights Tim. I was aware that the CPUs will choke beyond a
certain point. From memory on my machine this happens with 5 concurrent MPI
jobs with that benchmark that I am using.

My primary question was about scaling between the nodes. I was not getting
close to double the performance when running MPI jobs acros two 4 core
nodes. It may be better now since I have Open-MX in place, but I have not
repeated the benchmarks yet since I need to get one simulation job done
asap.

Regarding your mention of setting affinities and MPI ranks do you have a
specific (as in syntactically specific since I am a novice and easily
confused...) examples how I may want to set affinities to get the Westmere
node performing better?

ompi_info returns this: MCA paffinity: hwloc (MCA v2.0, API v2.0, Component
v1.6.5)

And finally to hybridisation... in a week or so I will get 4 AMD A10-6800
machines with 8Gb each on loan and will attempt to make them work along the
existing Intel nodes.

Victor

On 29 January 2014 22:03, Tim Prince <n8tm_at_[hidden]> wrote:

>
> On 1/29/2014 8:02 AM, Reuti wrote:
>
>> Quoting Victor <victor.major_at_[hidden]>:
>>
>> Thanks for the reply Reuti,
>>>
>>> There are two machines: Node1 with 12 physical cores (dual 6 core Xeon)
>>> and
>>>
>>
>> Do you have this CPU?
>>
>> http://ark.intel.com/de/products/37109/Intel-Xeon-
>> Processor-X5560-8M-Cache-2_80-GHz-6_40-GTs-Intel-QPI
>>
>> -- Reuti
>>
>> It's expected on the Xeon Westmere 6-core CPUs to see MPI performance
> saturating when all 4 of the internal buss paths are in use. For this
> reason, hybrid MPI/OpenMP with 2 cores per MPI rank, with affinity set so
> that each MPI rank has its own internal CPU buss, could out-perform plain
> MPI on those CPUs.
> That scheme of pairing cores on selected internal buss paths hasn't been
> repeated. Some influential customers learned to prefer the 4-core version
> of that CPU, given a reluctance to adopt MPI/OpenMP hybrid with affinity.
> If you want to talk about "downright strange," start thinking about the
> schemes to optimize performance of 8 threads with 2 threads assigned to
> each internal CPU buss on that CPU model. Or your scheme of trying to
> balance MPI performance between very different CPU models.
> Tim
>
>
>> Node2 with 4 physical cores (i5-2400).
>>>
>>> Regarding scaling on the single 12 core node, not it is also not linear.
>>> In
>>> fact it is downright strange. I do not remember the numbers right now but
>>> 10 jobs are faster than 11 and 12 are the fastest with peak performance
>>> of
>>> approximately 66 Msu/s which is also far from triple the 4 core
>>> performance. This odd non-linear behaviour also happens at the lower job
>>> counts on that 12 core node. I understand the decrease in scaling with
>>> increase in core count on the single node as the memory bandwidth is an
>>> issue.
>>>
>>> On the 4 core machine the scaling is progressive, ie. every additional
>>> job
>>> brings an increase in performance. Single core delivers 8.1 Msu/s while 4
>>> cores deliver 30.8 Msu/s. This is almost linear.
>>>
>>> Since my original email I have also installed Open-MX and recompiled
>>> OpenMPI to use it. This has resulted in approximately 10% better
>>> performance using the existing GbE hardware.
>>>
>>>
>>> On 29 January 2014 19:40, Reuti <reuti_at_[hidden]> wrote:
>>>
>>> Am 29.01.2014 um 03:00 schrieb Victor:
>>>>
>>>> > I am running a CFD simulation benchmark cavity3d available within
>>>> http://www.palabos.org/images/palabos_releases/palabos-v1.4r1.tgz
>>>> >
>>>> > It is a parallel friendly Lattice Botlzmann solver library.
>>>> >
>>>> > Palabos provides benchmark results for the cavity3d on several
>>>> different
>>>> platforms and variables here:
>>>> http://wiki.palabos.org/plb_wiki:benchmark:cavity_n400
>>>> >
>>>> > The problem that I have is that the benchmark performance on my
>>>> cluster
>>>> does not scale even close to a linear scale.
>>>> >
>>>> > My cluster configuration:
>>>> >
>>>> > Node1: Dual Xeon 5560 48 Gb RAM
>>>> > Node2: i5-2400 24 Gb RAM
>>>> >
>>>> > Gigabit ethernet connection on eth0
>>>> >
>>>> > OpenMPI 1.6.5 on Ubuntu 12.04.3
>>>> >
>>>> >
>>>> > Hostfile:
>>>> >
>>>> > Node1 -slots=4 -max-slots=4
>>>> > Node2 -slots=4 -max-slots=4
>>>> >
>>>> > MPI command: mpirun --mca btl_tcp_if_include eth0 --hostfile
>>>> /home/mpiuser/.mpi_hostfile -np 8 ./cavity3d 400
>>>> >
>>>> > Problem:
>>>> >
>>>> > cavity3d 400
>>>> >
>>>> > When I run mpirun -np 4 on Node1 I get 35.7615 Mega site updates per
>>>> second
>>>> > When I run mpirun -np 4 on Node2 I get 30.7972 Mega site updates per
>>>> second
>>>> > When I run mpirun --mca btl_tcp_if_include eth0 --hostfile
>>>> /home/mpiuser/.mpi_hostfile -np 8 ./cavity3d 400 I get 47.3538 Mega site
>>>> updates per second
>>>> >
>>>> > I understand that there are latencies with GbE and that there is MPI
>>>> overhead, but this performance scaling still seems very poor. Are my
>>>> expectations of scaling naive, or is there actually something wrong and
>>>> fixable that will improve the scaling? Optimistically I would like each
>>>> node to add to the cluster performance, not slow it down.
>>>> >
>>>> > Things get even worse if I run asymmetric number of mpi jobs in each
>>>> node. For instance running -np 12 on Node1
>>>>
>>>> Isn't this overloading the machine with only 8 real cores in total?
>>>>
>>>>
>>>> > is significantly faster than running -np 16 across Node1 and Node2,
>>>> thus
>>>> adding Node2 actually slows down the performance.
>>>>
>>>> The i5-2400 has only 4 cores and no threads.
>>>>
>>>> It depends on the algorithm how much data has to be exchanged between
>>>> the
>>>> processes, and this can indeed be worse when used across a network.
>>>>
>>>> Also: is the algorithm scaling linear when used on node1 only with 8
>>>> cores? When it's "35.7615 " with 4 cores, what result do you get with 8
>>>> cores on this machine.
>>>>
>>>> -- Reuti
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> --
> Tim Prince
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>