Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] poor btl sm latency
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2012-02-16 07:33:16


Yowza. With inconsistent results like that, it does sound like something is going on in the hardware. Unfortunately, I don't know much/anything about AMDs (Cisco is an Intel shop). :-\

Do you have (AMD's equivalent of) hyperthreading enabled, perchance?

In the latest 1.5.5 nightly tarball, I have just upgraded the included version of hwloc to be 1.3.2. Maybe a good step would be to download hwloc 1.3.2 and verify that lstopo is faithfully reporting the actual topology of your system. Can you do that?

On Feb 16, 2012, at 7:06 AM, Matthias Jurenz wrote:

> Jeff,
>
> sorry for the confusion - the all2all is a classic pingpong which uses
> MPI_Send/Recv with 0 byte messages.
>
> One thing I just noticed when using NetPIPE/MPI. Platform MPI results in
> almost constant latencies for small messages (~0.89us), where I don't know
> about process-binding in Platform MPI - I just used the defaults.
> When using Open MPI (regardless of core/socket-binding) the results differ from
> run to run:
>
> === FIRST RUN ===
> $ mpirun -np 2 --bind-to-socket ./NPmpi_ompi1.5.5 -S -u 12 -n 100000
> Using synchronous sends
> 1: n029
> Using synchronous sends
> 0: n029
> Now starting the main loop
> 0: 1 bytes 100000 times --> 4.66 Mbps in 1.64 usec
> 1: 2 bytes 100000 times --> 8.94 Mbps in 1.71 usec
> 2: 3 bytes 100000 times --> 13.65 Mbps in 1.68 usec
> 3: 4 bytes 100000 times --> 17.91 Mbps in 1.70 usec
> 4: 6 bytes 100000 times --> 29.04 Mbps in 1.58 usec
> 5: 8 bytes 100000 times --> 39.06 Mbps in 1.56 usec
> 6: 12 bytes 100000 times --> 57.58 Mbps in 1.59 usec
>
> === SECOND RUN (~3s after the previous run) ===
> $ mpirun -np 2 --bind-to-socket ./NPmpi_ompi1.5.5 -S -u 12 -n 100000
> Using synchronous sends
> 1: n029
> Using synchronous sends
> 0: n029
> Now starting the main loop
> 0: 1 bytes 100000 times --> 5.73 Mbps in 1.33 usec
> 1: 2 bytes 100000 times --> 11.45 Mbps in 1.33 usec
> 2: 3 bytes 100000 times --> 17.13 Mbps in 1.34 usec
> 3: 4 bytes 100000 times --> 22.94 Mbps in 1.33 usec
> 4: 6 bytes 100000 times --> 34.39 Mbps in 1.33 usec
> 5: 8 bytes 100000 times --> 46.40 Mbps in 1.32 usec
> 6: 12 bytes 100000 times --> 68.92 Mbps in 1.33 usec
>
> === THIRD RUN ===
> $ mpirun -np 2 --bind-to-socket ./NPmpi_ompi1.5.5 -S -u 12 -n 100000
> Using synchronous sends
> 0: n029
> Using synchronous sends
> 1: n029
> Now starting the main loop
> 0: 1 bytes 100000 times --> 3.50 Mbps in 2.18 usec
> 1: 2 bytes 100000 times --> 6.99 Mbps in 2.18 usec
> 2: 3 bytes 100000 times --> 10.48 Mbps in 2.18 usec
> 3: 4 bytes 100000 times --> 14.00 Mbps in 2.18 usec
> 4: 6 bytes 100000 times --> 20.98 Mbps in 2.18 usec
> 5: 8 bytes 100000 times --> 27.84 Mbps in 2.19 usec
> 6: 12 bytes 100000 times --> 41.99 Mbps in 2.18 usec
>
> At first appearance, I assumed that some CPU power saving feature is enabled.
> But the CPU frequency scaling is set to "performance" and there is only one
> available frequency (2.2GHz).
>
> Any idea how this can happen?
>
>
> Matthias
>
> On Wednesday 15 February 2012 19:29:38 Jeff Squyres wrote:
>> Something is definitely wrong -- 1.4us is way too high for a 0 or 1 byte
>> HRT ping pong. What is this all2all benchmark, btw? Is it measuring an
>> MPI_ALLTOALL, or a pingpong?
>>
>> FWIW, on an older Nehalem machine running NetPIPE/MPI, I'm getting about
>> .27us latencies for short messages over sm and binding to socket.
>>
>> On Feb 14, 2012, at 7:20 AM, Matthias Jurenz wrote:
>>> I've built Open MPI 1.5.5rc1 (tarball from Web) with CFLAGS=-O3.
>>> Unfortunately, also without any effect.
>>>
>>> Here some results with enabled binding reports:
>>>
>>> $ mpirun *--bind-to-core* --report-bindings -np 2 ./all2all_ompi1.5.5
>>> [n043:61313] [[56788,0],0] odls:default:fork binding child [[56788,1],1]
>>> to cpus 0002
>>> [n043:61313] [[56788,0],0] odls:default:fork binding child [[56788,1],0]
>>> to cpus 0001
>>> latency: 1.415us
>>>
>>> $ mpirun *-mca maffinity hwloc --bind-to-core* --report-bindings -np 2
>>> ./all2all_ompi1.5.5
>>> [n043:61469] [[49736,0],0] odls:default:fork binding child [[49736,1],1]
>>> to cpus 0002
>>> [n043:61469] [[49736,0],0] odls:default:fork binding child [[49736,1],0]
>>> to cpus 0001
>>> latency: 1.4us
>>>
>>> $ mpirun *-mca maffinity first_use --bind-to-core* --report-bindings -np
>>> 2 ./all2all_ompi1.5.5
>>> [n043:61508] [[49681,0],0] odls:default:fork binding child [[49681,1],1]
>>> to cpus 0002
>>> [n043:61508] [[49681,0],0] odls:default:fork binding child [[49681,1],0]
>>> to cpus 0001
>>> latency: 1.4us
>>>
>>>
>>> $ mpirun *--bind-to-socket* --report-bindings -np 2 ./all2all_ompi1.5.5
>>> [n043:61337] [[56780,0],0] odls:default:fork binding child [[56780,1],1]
>>> to socket 0 cpus 0001
>>> [n043:61337] [[56780,0],0] odls:default:fork binding child [[56780,1],0]
>>> to socket 0 cpus 0001
>>> latency: 4.0us
>>>
>>> $ mpirun *-mca maffinity hwloc --bind-to-socket* --report-bindings -np 2
>>> ./all2all_ompi1.5.5
>>> [n043:61615] [[49914,0],0] odls:default:fork binding child [[49914,1],1]
>>> to socket 0 cpus 0001
>>> [n043:61615] [[49914,0],0] odls:default:fork binding child [[49914,1],0]
>>> to socket 0 cpus 0001
>>> latency: 4.0us
>>>
>>> $ mpirun *-mca maffinity first_use --bind-to-socket* --report-bindings
>>> -np 2 ./all2all_ompi1.5.5
>>> [n043:61639] [[49810,0],0] odls:default:fork binding child [[49810,1],1]
>>> to socket 0 cpus 0001
>>> [n043:61639] [[49810,0],0] odls:default:fork binding child [[49810,1],0]
>>> to socket 0 cpus 0001
>>> latency: 4.0us
>>>
>>>
>>> If socket-binding is enabled it seems that all ranks are bind to the very
>>> first core of one and the same socket. Is it intended? I expected that
>>> each rank gets its own socket (i.e. 2 ranks -> 2 sockets)...
>>>
>>> Matthias
>>>
>>> On Monday 13 February 2012 22:36:50 Jeff Squyres wrote:
>>>> Also, double check that you have an optimized build, not a debugging
>>>> build.
>>>>
>>>> SVN and HG checkouts default to debugging builds, which add in lots of
>>>> latency.
>>>>
>>>> On Feb 13, 2012, at 10:22 AM, Ralph Castain wrote:
>>>>> Few thoughts
>>>>>
>>>>> 1. Bind to socket is broken in 1.5.4 - fixed in next release
>>>>>
>>>>> 2. Add --report-bindings to cmd line and see where it thinks the procs
>>>>> are bound
>>>>>
>>>>> 3. Sounds lime memory may not be local - might be worth checking mem
>>>>> binding.
>>>>>
>>>>> Sent from my iPad
>>>>>
>>>>> On Feb 13, 2012, at 7:07 AM, Matthias Jurenz <matthias.jurenz_at_tu-
>>>
>>> dresden.de> wrote:
>>>>>> Hi Sylvain,
>>>>>>
>>>>>> thanks for the quick response!
>>>>>>
>>>>>> Here some results with enabled process binding. I hope I used the
>>>>>> parameters correctly...
>>>>>>
>>>>>> bind two ranks to one socket:
>>>>>> $ mpirun -np 2 --bind-to-core ./all2all
>>>>>> $ mpirun -np 2 -mca mpi_paffinity_alone 1 ./all2all
>>>>>>
>>>>>> bind two ranks to two different sockets:
>>>>>> $ mpirun -np 2 --bind-to-socket ./all2all
>>>>>>
>>>>>> All three runs resulted in similar bad latencies (~1.4us).
>>>>>>
>>>>>> :-(
>>>>>>
>>>>>> Matthias
>>>>>>
>>>>>> On Monday 13 February 2012 12:43:22 sylvain.jeaugey_at_[hidden] wrote:
>>>>>>> Hi Matthias,
>>>>>>>
>>>>>>> You might want to play with process binding to see if your problem is
>>>>>>> related to bad memory affinity.
>>>>>>>
>>>>>>> Try to launch pingpong on two CPUs of the same socket, then on
>>>>>>> different sockets (i.e. bind each process to a core, and try
>>>>>>> different configurations).
>>>>>>>
>>>>>>> Sylvain
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> De : Matthias Jurenz <matthias.jurenz_at_[hidden]>
>>>>>>> A : Open MPI Developers <devel_at_[hidden]>
>>>>>>> Date : 13/02/2012 12:12
>>>>>>> Objet : [OMPI devel] poor btl sm latency
>>>>>>> Envoyé par : devel-bounces_at_[hidden]
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hello all,
>>>>>>>
>>>>>>> on our new AMD cluster (AMD Opteron 6274, 2,2GHz) we get very bad
>>>>>>> latencies
>>>>>>> (~1.5us) when performing 0-byte p2p communication on one single node
>>>>>>> using the
>>>>>>> Open MPI sm BTL. When using Platform MPI we get ~0.5us latencies
>>>>>>> which is pretty good. The bandwidth results are similar for both MPI
>>>>>>> implementations
>>>>>>> (~3,3GB/s) - this is okay.
>>>>>>>
>>>>>>> One node has 64 cores and 64Gb RAM where it doesn't matter how many
>>>>>>> ranks allocated by the application. We get similar results with
>>>>>>> different number of
>>>>>>> ranks.
>>>>>>>
>>>>>>> We are using Open MPI 1.5.4 which is built by gcc 4.3.4 without any
>>>>>>> special
>>>>>>> configure options except the installation prefix and the location of
>>>>>>> the LSF
>>>>>>> stuff.
>>>>>>>
>>>>>>> As mentioned at http://www.open-mpi.org/faq/?category=sm we tried to
>>>>>>> use /dev/shm instead of /tmp for the session directory, but it had no
>>>>>>> effect. Furthermore, we tried the current release candidate 1.5.5rc1
>>>>>>> of Open MPI which
>>>>>>> provides an option to use the SysV shared memory (-mca shmem sysv) -
>>>>>>> also this
>>>>>>> results in similar poor latencies.
>>>>>>>
>>>>>>> Do you have any idea? Please help!
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Matthias
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/