Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] poor btl sm latency
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2012-02-15 13:29:38


Something is definitely wrong -- 1.4us is way too high for a 0 or 1 byte HRT ping pong. What is this all2all benchmark, btw? Is it measuring an MPI_ALLTOALL, or a pingpong?

FWIW, on an older Nehalem machine running NetPIPE/MPI, I'm getting about .27us latencies for short messages over sm and binding to socket.

On Feb 14, 2012, at 7:20 AM, Matthias Jurenz wrote:

> I've built Open MPI 1.5.5rc1 (tarball from Web) with CFLAGS=-O3.
> Unfortunately, also without any effect.
>
> Here some results with enabled binding reports:
>
> $ mpirun *--bind-to-core* --report-bindings -np 2 ./all2all_ompi1.5.5
> [n043:61313] [[56788,0],0] odls:default:fork binding child [[56788,1],1] to
> cpus 0002
> [n043:61313] [[56788,0],0] odls:default:fork binding child [[56788,1],0] to
> cpus 0001
> latency: 1.415us
>
> $ mpirun *-mca maffinity hwloc --bind-to-core* --report-bindings -np 2
> ./all2all_ompi1.5.5
> [n043:61469] [[49736,0],0] odls:default:fork binding child [[49736,1],1] to
> cpus 0002
> [n043:61469] [[49736,0],0] odls:default:fork binding child [[49736,1],0] to
> cpus 0001
> latency: 1.4us
>
> $ mpirun *-mca maffinity first_use --bind-to-core* --report-bindings -np 2
> ./all2all_ompi1.5.5
> [n043:61508] [[49681,0],0] odls:default:fork binding child [[49681,1],1] to
> cpus 0002
> [n043:61508] [[49681,0],0] odls:default:fork binding child [[49681,1],0] to
> cpus 0001
> latency: 1.4us
>
>
> $ mpirun *--bind-to-socket* --report-bindings -np 2 ./all2all_ompi1.5.5
> [n043:61337] [[56780,0],0] odls:default:fork binding child [[56780,1],1] to
> socket 0 cpus 0001
> [n043:61337] [[56780,0],0] odls:default:fork binding child [[56780,1],0] to
> socket 0 cpus 0001
> latency: 4.0us
>
> $ mpirun *-mca maffinity hwloc --bind-to-socket* --report-bindings -np 2
> ./all2all_ompi1.5.5
> [n043:61615] [[49914,0],0] odls:default:fork binding child [[49914,1],1] to
> socket 0 cpus 0001
> [n043:61615] [[49914,0],0] odls:default:fork binding child [[49914,1],0] to
> socket 0 cpus 0001
> latency: 4.0us
>
> $ mpirun *-mca maffinity first_use --bind-to-socket* --report-bindings -np 2
> ./all2all_ompi1.5.5
> [n043:61639] [[49810,0],0] odls:default:fork binding child [[49810,1],1] to
> socket 0 cpus 0001
> [n043:61639] [[49810,0],0] odls:default:fork binding child [[49810,1],0] to
> socket 0 cpus 0001
> latency: 4.0us
>
>
> If socket-binding is enabled it seems that all ranks are bind to the very first
> core of one and the same socket. Is it intended? I expected that each rank
> gets its own socket (i.e. 2 ranks -> 2 sockets)...
>
> Matthias
>
> On Monday 13 February 2012 22:36:50 Jeff Squyres wrote:
>> Also, double check that you have an optimized build, not a debugging build.
>>
>> SVN and HG checkouts default to debugging builds, which add in lots of
>> latency.
>>
>> On Feb 13, 2012, at 10:22 AM, Ralph Castain wrote:
>>> Few thoughts
>>>
>>> 1. Bind to socket is broken in 1.5.4 - fixed in next release
>>>
>>> 2. Add --report-bindings to cmd line and see where it thinks the procs
>>> are bound
>>>
>>> 3. Sounds lime memory may not be local - might be worth checking mem
>>> binding.
>>>
>>> Sent from my iPad
>>>
>>> On Feb 13, 2012, at 7:07 AM, Matthias Jurenz <matthias.jurenz_at_tu-
> dresden.de> wrote:
>>>> Hi Sylvain,
>>>>
>>>> thanks for the quick response!
>>>>
>>>> Here some results with enabled process binding. I hope I used the
>>>> parameters correctly...
>>>>
>>>> bind two ranks to one socket:
>>>> $ mpirun -np 2 --bind-to-core ./all2all
>>>> $ mpirun -np 2 -mca mpi_paffinity_alone 1 ./all2all
>>>>
>>>> bind two ranks to two different sockets:
>>>> $ mpirun -np 2 --bind-to-socket ./all2all
>>>>
>>>> All three runs resulted in similar bad latencies (~1.4us).
>>>>
>>>> :-(
>>>>
>>>> Matthias
>>>>
>>>> On Monday 13 February 2012 12:43:22 sylvain.jeaugey_at_[hidden] wrote:
>>>>> Hi Matthias,
>>>>>
>>>>> You might want to play with process binding to see if your problem is
>>>>> related to bad memory affinity.
>>>>>
>>>>> Try to launch pingpong on two CPUs of the same socket, then on
>>>>> different sockets (i.e. bind each process to a core, and try different
>>>>> configurations).
>>>>>
>>>>> Sylvain
>>>>>
>>>>>
>>>>>
>>>>> De : Matthias Jurenz <matthias.jurenz_at_[hidden]>
>>>>> A : Open MPI Developers <devel_at_[hidden]>
>>>>> Date : 13/02/2012 12:12
>>>>> Objet : [OMPI devel] poor btl sm latency
>>>>> Envoyé par : devel-bounces_at_[hidden]
>>>>>
>>>>>
>>>>>
>>>>> Hello all,
>>>>>
>>>>> on our new AMD cluster (AMD Opteron 6274, 2,2GHz) we get very bad
>>>>> latencies
>>>>> (~1.5us) when performing 0-byte p2p communication on one single node
>>>>> using the
>>>>> Open MPI sm BTL. When using Platform MPI we get ~0.5us latencies which
>>>>> is pretty good. The bandwidth results are similar for both MPI
>>>>> implementations
>>>>> (~3,3GB/s) - this is okay.
>>>>>
>>>>> One node has 64 cores and 64Gb RAM where it doesn't matter how many
>>>>> ranks allocated by the application. We get similar results with
>>>>> different number of
>>>>> ranks.
>>>>>
>>>>> We are using Open MPI 1.5.4 which is built by gcc 4.3.4 without any
>>>>> special
>>>>> configure options except the installation prefix and the location of
>>>>> the LSF
>>>>> stuff.
>>>>>
>>>>> As mentioned at http://www.open-mpi.org/faq/?category=sm we tried to
>>>>> use /dev/shm instead of /tmp for the session directory, but it had no
>>>>> effect. Furthermore, we tried the current release candidate 1.5.5rc1
>>>>> of Open MPI which
>>>>> provides an option to use the SysV shared memory (-mca shmem sysv) -
>>>>> also this
>>>>> results in similar poor latencies.
>>>>>
>>>>> Do you have any idea? Please help!
>>>>>
>>>>> Thanks,
>>>>> Matthias
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/