Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Running on two nodes slower than running on one node
From: Tim Prince (n8tm_at_[hidden])
Date: 2014-01-29 09:03:55


On 1/29/2014 8:02 AM, Reuti wrote:
> Quoting Victor <victor.major_at_[hidden]>:
>
>> Thanks for the reply Reuti,
>>
>> There are two machines: Node1 with 12 physical cores (dual 6 core
>> Xeon) and
>
> Do you have this CPU?
>
> http://ark.intel.com/de/products/37109/Intel-Xeon-Processor-X5560-8M-Cache-2_80-GHz-6_40-GTs-Intel-QPI
>
>
> -- Reuti
>
It's expected on the Xeon Westmere 6-core CPUs to see MPI performance
saturating when all 4 of the internal buss paths are in use. For this
reason, hybrid MPI/OpenMP with 2 cores per MPI rank, with affinity set
so that each MPI rank has its own internal CPU buss, could out-perform
plain MPI on those CPUs.
That scheme of pairing cores on selected internal buss paths hasn't been
repeated. Some influential customers learned to prefer the 4-core
version of that CPU, given a reluctance to adopt MPI/OpenMP hybrid with
affinity.
If you want to talk about "downright strange," start thinking about the
schemes to optimize performance of 8 threads with 2 threads assigned to
each internal CPU buss on that CPU model. Or your scheme of trying to
balance MPI performance between very different CPU models.
Tim
>
>> Node2 with 4 physical cores (i5-2400).
>>
>> Regarding scaling on the single 12 core node, not it is also not
>> linear. In
>> fact it is downright strange. I do not remember the numbers right now
>> but
>> 10 jobs are faster than 11 and 12 are the fastest with peak
>> performance of
>> approximately 66 Msu/s which is also far from triple the 4 core
>> performance. This odd non-linear behaviour also happens at the lower job
>> counts on that 12 core node. I understand the decrease in scaling with
>> increase in core count on the single node as the memory bandwidth is an
>> issue.
>>
>> On the 4 core machine the scaling is progressive, ie. every
>> additional job
>> brings an increase in performance. Single core delivers 8.1 Msu/s
>> while 4
>> cores deliver 30.8 Msu/s. This is almost linear.
>>
>> Since my original email I have also installed Open-MX and recompiled
>> OpenMPI to use it. This has resulted in approximately 10% better
>> performance using the existing GbE hardware.
>>
>>
>> On 29 January 2014 19:40, Reuti <reuti_at_[hidden]> wrote:
>>
>>> Am 29.01.2014 um 03:00 schrieb Victor:
>>>
>>> > I am running a CFD simulation benchmark cavity3d available within
>>> http://www.palabos.org/images/palabos_releases/palabos-v1.4r1.tgz
>>> >
>>> > It is a parallel friendly Lattice Botlzmann solver library.
>>> >
>>> > Palabos provides benchmark results for the cavity3d on several
>>> different
>>> platforms and variables here:
>>> http://wiki.palabos.org/plb_wiki:benchmark:cavity_n400
>>> >
>>> > The problem that I have is that the benchmark performance on my
>>> cluster
>>> does not scale even close to a linear scale.
>>> >
>>> > My cluster configuration:
>>> >
>>> > Node1: Dual Xeon 5560 48 Gb RAM
>>> > Node2: i5-2400 24 Gb RAM
>>> >
>>> > Gigabit ethernet connection on eth0
>>> >
>>> > OpenMPI 1.6.5 on Ubuntu 12.04.3
>>> >
>>> >
>>> > Hostfile:
>>> >
>>> > Node1 -slots=4 -max-slots=4
>>> > Node2 -slots=4 -max-slots=4
>>> >
>>> > MPI command: mpirun --mca btl_tcp_if_include eth0 --hostfile
>>> /home/mpiuser/.mpi_hostfile -np 8 ./cavity3d 400
>>> >
>>> > Problem:
>>> >
>>> > cavity3d 400
>>> >
>>> > When I run mpirun -np 4 on Node1 I get 35.7615 Mega site updates per
>>> second
>>> > When I run mpirun -np 4 on Node2 I get 30.7972 Mega site updates per
>>> second
>>> > When I run mpirun --mca btl_tcp_if_include eth0 --hostfile
>>> /home/mpiuser/.mpi_hostfile -np 8 ./cavity3d 400 I get 47.3538 Mega
>>> site
>>> updates per second
>>> >
>>> > I understand that there are latencies with GbE and that there is MPI
>>> overhead, but this performance scaling still seems very poor. Are my
>>> expectations of scaling naive, or is there actually something wrong and
>>> fixable that will improve the scaling? Optimistically I would like each
>>> node to add to the cluster performance, not slow it down.
>>> >
>>> > Things get even worse if I run asymmetric number of mpi jobs in each
>>> node. For instance running -np 12 on Node1
>>>
>>> Isn't this overloading the machine with only 8 real cores in total?
>>>
>>>
>>> > is significantly faster than running -np 16 across Node1 and
>>> Node2, thus
>>> adding Node2 actually slows down the performance.
>>>
>>> The i5-2400 has only 4 cores and no threads.
>>>
>>> It depends on the algorithm how much data has to be exchanged
>>> between the
>>> processes, and this can indeed be worse when used across a network.
>>>
>>> Also: is the algorithm scaling linear when used on node1 only with 8
>>> cores? When it's "35.7615 " with 4 cores, what result do you get with 8
>>> cores on this machine.
>>>
>>> -- Reuti
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Tim Prince