Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Running on two nodes slower than running on one node
From: Reuti (reuti_at_[hidden])
Date: 2014-01-29 06:40:12


Am 29.01.2014 um 03:00 schrieb Victor:

> I am running a CFD simulation benchmark cavity3d available within http://www.palabos.org/images/palabos_releases/palabos-v1.4r1.tgz
>
> It is a parallel friendly Lattice Botlzmann solver library.
>
> Palabos provides benchmark results for the cavity3d on several different platforms and variables here: http://wiki.palabos.org/plb_wiki:benchmark:cavity_n400
>
> The problem that I have is that the benchmark performance on my cluster does not scale even close to a linear scale.
>
> My cluster configuration:
>
> Node1: Dual Xeon 5560 48 Gb RAM
> Node2: i5-2400 24 Gb RAM
>
> Gigabit ethernet connection on eth0
>
> OpenMPI 1.6.5 on Ubuntu 12.04.3
>
>
> Hostfile:
>
> Node1 -slots=4 -max-slots=4
> Node2 -slots=4 -max-slots=4
>
> MPI command: mpirun --mca btl_tcp_if_include eth0 --hostfile /home/mpiuser/.mpi_hostfile -np 8 ./cavity3d 400
>
> Problem:
>
> cavity3d 400
>
> When I run mpirun -np 4 on Node1 I get 35.7615 Mega site updates per second
> When I run mpirun -np 4 on Node2 I get 30.7972 Mega site updates per second
> When I run mpirun --mca btl_tcp_if_include eth0 --hostfile /home/mpiuser/.mpi_hostfile -np 8 ./cavity3d 400 I get 47.3538 Mega site updates per second
>
> I understand that there are latencies with GbE and that there is MPI overhead, but this performance scaling still seems very poor. Are my expectations of scaling naive, or is there actually something wrong and fixable that will improve the scaling? Optimistically I would like each node to add to the cluster performance, not slow it down.
>
> Things get even worse if I run asymmetric number of mpi jobs in each node. For instance running -np 12 on Node1

Isn't this overloading the machine with only 8 real cores in total?

> is significantly faster than running -np 16 across Node1 and Node2, thus adding Node2 actually slows down the performance.

The i5-2400 has only 4 cores and no threads.

It depends on the algorithm how much data has to be exchanged between the processes, and this can indeed be worse when used across a network.

Also: is the algorithm scaling linear when used on node1 only with 8 cores? When it's "35.7615 " with 4 cores, what result do you get with 8 cores on this machine.

-- Reuti