Hi guys
I'm benchmarking our (well tested) parallel code on and AMD based system, featuring 2x AMD Opteron(TM) Processor 6276, with 16 cores each for a total of 32 cores. The system is running Scientific Linux 6.1 and OpenMPI 1.4.5.
When I run a single core job the performance is as expected. However, when I run with 32 processes the performance drops to about 60% (when compared with other systems running the exact same problem, so this is not a code scaling issue). I think this may have to do with core binding / NUMA, but I haven't been able to get any improvement out of the bind-* mpirun options.
Any suggestions?
Thanks in advance,
Ricardo
P.S: Here's the output of lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
CPU socket(s): 2
NUMA node(s): 4
Vendor ID: AuthenticAMD
CPU family: 21
Model: 1
Stepping: 2
CPU MHz: 2300.045
BogoMIPS: 4599.38
Virtualization: AMD-V
L1d cache: 16K
L1i cache: 64K
L2 cache: 2048K
L3 cache: 6144K
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14
NUMA node1 CPU(s): 16,18,20,22,24,26,28,30
NUMA node2 CPU(s): 1,3,5,7,9,11,13,15
NUMA node3 CPU(s): 17,19,21,23,25,27,29,31
---
Ricardo Fonseca
Associate Professor
GoLP - Grupo de Lasers e Plasmas
Instituto de Plasmas e Fusão Nuclear
Instituto Superior Técnico
Av. Rovisco Pais
1049-001 Lisboa
Portugal
tel: +351 21 8419202
fax: +351 21 8464455
web: http://golp.ist.utl.pt/
|