Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Help with multicore AMD machine performance
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-03-30 08:30:57


FWIW: 1.5.5 still doesn't support binding to NUMA regions, for example - and the script doesn't really do anything more than bind to cores. I believe only the trunk provides a more comprehensive set of binding options.

Given the described NUMA layout, I suspect bind-to-NUMA is going to make the biggest difference.

On Mar 30, 2012, at 6:17 AM, Pavel Mezentsev wrote:

> You can try running using this script:
> #!/bin/bash
>
> s=$(($OMPI_COMM_WORLD_NODE_RANK))
>
> numactl --physcpubind=$((s)) --localalloc ./YOUR_PROG
>
> instead of 'mpirun ... ./YOUR_PROG' run 'mpirun ... ./SCRIPT
>
> I tried this with openmpi-1.5.4 and it helped.
>
> Best regards, Pavel Mezentsev
>
> P.S openmpi-1.5.5 bind processes correctly, so you can try it as well.
>
> 2012/3/30 Ralph Castain <rhc_at_[hidden]>
> I think you'd have much better luck using the developer's trunk as the binding there is much better - e.g., you can bind to NUMA instead of just cores. The 1.4 binding is pretty limited.
>
> http://www.open-mpi.org/nightly/trunk/
>
> On Mar 30, 2012, at 5:02 AM, Ricardo Fonseca wrote:
>
> > Hi guys
> >
> > I'm benchmarking our (well tested) parallel code on and AMD based system, featuring 2x AMD Opteron(TM) Processor 6276, with 16 cores each for a total of 32 cores. The system is running Scientific Linux 6.1 and OpenMPI 1.4.5.
> >
> > When I run a single core job the performance is as expected. However, when I run with 32 processes the performance drops to about 60% (when compared with other systems running the exact same problem, so this is not a code scaling issue). I think this may have to do with core binding / NUMA, but I haven't been able to get any improvement out of the bind-* mpirun options.
> >
> > Any suggestions?
> >
> > Thanks in advance,
> > Ricardo
> >
> > P.S: Here's the output of lscpu
> >
> > Architecture: x86_64
> > CPU op-mode(s): 32-bit, 64-bit
> > Byte Order: Little Endian
> > CPU(s): 32
> > On-line CPU(s) list: 0-31
> > Thread(s) per core: 2
> > Core(s) per socket: 8
> > CPU socket(s): 2
> > NUMA node(s): 4
> > Vendor ID: AuthenticAMD
> > CPU family: 21
> > Model: 1
> > Stepping: 2
> > CPU MHz: 2300.045
> > BogoMIPS: 4599.38
> > Virtualization: AMD-V
> > L1d cache: 16K
> > L1i cache: 64K
> > L2 cache: 2048K
> > L3 cache: 6144K
> > NUMA node0 CPU(s): 0,2,4,6,8,10,12,14
> > NUMA node1 CPU(s): 16,18,20,22,24,26,28,30
> > NUMA node2 CPU(s): 1,3,5,7,9,11,13,15
> > NUMA node3 CPU(s): 17,19,21,23,25,27,29,31
> >
> > ---
> > Ricardo Fonseca
> >
> > Associate Professor
> > GoLP - Grupo de Lasers e Plasmas
> > Instituto de Plasmas e Fusão Nuclear
> > Instituto Superior Técnico
> > Av. Rovisco Pais
> > 1049-001 Lisboa
> > Portugal
> >
> > tel: +351 21 8419202
> > fax: +351 21 8464455
> > web: http://golp.ist.utl.pt/
> >
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users