Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] poor btl sm latency
From: Matthias Jurenz (matthias.jurenz_at_[hidden])
Date: 2012-03-15 11:06:31


We made a big step forward today!

The used Kernel has a bug regarding to the shared L1 instruction cache in AMD
Bulldozer processors:
See
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=dfb09f9b7ab03fd367740e541a5caf830ed56726
and
http://developer.amd.com/Assets/SharedL1InstructionCacheonAMD15hCPU.pdf

Until the Kernel is patched we disable the address-space layout randomization
(ASLR) as described in the above PDF:

   $ sudo /sbin/sysctl -w kernel.randomize_va_space=0

Therewith, NetPIPE results in ~0.5us latency when binding the processes for
L2/L1I cache sharing (i.e. -bind-to-core).

However, when binding the processes for exclusive L2/L1I caches (i.e. -cpus-
per-proc 2) we still get ~1.1us latency. I don't think that the upcoming
kernel patch will help for this kind of process binding...

Matthias

On Monday 12 March 2012 11:09:01 Matthias Jurenz wrote:
> It's a SUSE Linux Enterprise Server 11 Service Pack 1 with kernel version
> 2.6.32.49-0.3-default.
>
> Matthias
>
> On Friday 09 March 2012 16:36:41 you wrote:
> > What OS are you using ?
> >
> > Joshua
> >
> > ----- Original Message -----
> > From: Matthias Jurenz [mailto:matthias.jurenz_at_[hidden]]
> > Sent: Friday, March 09, 2012 08:50 AM
> > To: Open MPI Developers <devel_at_[hidden]>
> > Cc: Mora, Joshua
> > Subject: Re: [OMPI devel] poor btl sm latency
> >
> > I just made an interesting observation:
> >
> > When binding the processes to two neighboring cores (L2 sharing) NetPIPE
> > shows *sometimes* pretty good results: ~0.5us
> >
> > $ mpirun -mca btl sm,self -np 1 hwloc-bind -v core:0 ./NPmpi_ompi1.5.5 -u
> > 4 -n 100000 -p 0 : -np 1 hwloc-bind -v core:1 ./NPmpi_ompi1.5.5 -u 4 -n
> > 100000 -p 0 using object #0 depth 6 below cpuset 0xffffffff,0xffffffff
> > using object #1 depth 6 below cpuset 0xffffffff,0xffffffff
> > adding 0x00000001 to 0x0
> > adding 0x00000001 to 0x0
> > assuming the command starts at ./NPmpi_ompi1.5.5
> > binding on cpu set 0x00000001
> > adding 0x00000002 to 0x0
> > adding 0x00000002 to 0x0
> > assuming the command starts at ./NPmpi_ompi1.5.5
> > binding on cpu set 0x00000002
> > Using no perturbations
> >
> > 0: n035
> > Using no perturbations
> >
> > 1: n035
> > Now starting the main loop
> >
> > 0: 1 bytes 100000 times --> 6.01 Mbps in 1.27 usec
> > 1: 2 bytes 100000 times --> 12.04 Mbps in 1.27 usec
> > 2: 3 bytes 100000 times --> 18.07 Mbps in 1.27 usec
> > 3: 4 bytes 100000 times --> 24.13 Mbps in 1.26 usec
> >
> > $ mpirun -mca btl sm,self -np 1 hwloc-bind -v core:0 ./NPmpi_ompi1.5.5 -u
> > 4 -n 100000 -p 0 : -np 1 hwloc-bind -v core:1 ./NPmpi_ompi1.5.5 -u 4 -n
> > 100000 -p 0 using object #0 depth 6 below cpuset 0xffffffff,0xffffffff
> > adding 0x00000001 to 0x0
> > adding 0x00000001 to 0x0
> > assuming the command starts at ./NPmpi_ompi1.5.5
> > binding on cpu set 0x00000001
> > using object #1 depth 6 below cpuset 0xffffffff,0xffffffff
> > adding 0x00000002 to 0x0
> > adding 0x00000002 to 0x0
> > assuming the command starts at ./NPmpi_ompi1.5.5
> > binding on cpu set 0x00000002
> > Using no perturbations
> >
> > 0: n035
> > Using no perturbations
> >
> > 1: n035
> > Now starting the main loop
> >
> > 0: 1 bytes 100000 times --> 12.96 Mbps in 0.59 usec
> > 1: 2 bytes 100000 times --> 25.78 Mbps in 0.59 usec
> > 2: 3 bytes 100000 times --> 38.62 Mbps in 0.59 usec
> > 3: 4 bytes 100000 times --> 52.88 Mbps in 0.58 usec
> >
> > I can reproduce that approximately every tenth run.
> >
> > When binding the processes for exclusive L2 caches (e.g. core 0 and 2) I
> > get constant latencies ~1.1us
> >
> > Matthias
> >
> > On Monday 05 March 2012 09:52:39 Matthias Jurenz wrote:
> > > Here the SM BTL parameters:
> > >
> > > $ ompi_info --param btl sm
> > > MCA btl: parameter "btl_base_verbose" (current value: <0>, data source:
> > > default value) Verbosity level of the BTL framework
> > > MCA btl: parameter "btl" (current value: <self,sm,openib>, data source:
> > > file
> > > [/sw/atlas/libraries/openmpi/1.5.5rc3/x86_64/etc/openmpi-mca-params.con
> > > f] ) Default selection set of components for the btl framework (<none>
> > > means use all components that can be found)
> > > MCA btl: information "btl_sm_have_knem_support" (value: <1>, data
> > > source: default value) Whether this component supports the knem Linux
> > > kernel module or not
> > > MCA btl: parameter "btl_sm_use_knem" (current value: <-1>, data source:
> > > default value) Whether knem support is desired or not (negative = try
> > > to enable knem support, but continue even if it is not available, 0 =
> > > do not enable knem support, positive = try to enable knem support and
> > > fail if it is not available)
> > > MCA btl: parameter "btl_sm_knem_dma_min" (current value: <0>, data
> > > source: default value) Minimum message size (in bytes) to use the knem
> > > DMA mode; ignored if knem does not support DMA mode (0 = do not use the
> > > knem DMA mode) MCA btl: parameter "btl_sm_knem_max_simultaneous"
> > > (current value: <0>, data source: default value) Max number of
> > > simultaneous ongoing knem operations to support (0 = do everything
> > > synchronously, which probably gives the best large message latency; >0
> > > means to do all operations asynchronously, which supports better
> > > overlap for simultaneous large message sends)
> > > MCA btl: parameter "btl_sm_free_list_num" (current value: <8>, data
> > > source: default value)
> > > MCA btl: parameter "btl_sm_free_list_max" (current value: <-1>, data
> > > source: default value)
> > > MCA btl: parameter "btl_sm_free_list_inc" (current value: <64>, data
> > > source: default value)
> > > MCA btl: parameter "btl_sm_max_procs" (current value: <-1>, data
> > > source: default value)
> > > MCA btl: parameter "btl_sm_mpool" (current value: <sm>, data source:
> > > default value)
> > > MCA btl: parameter "btl_sm_fifo_size" (current value: <4096>, data
> > > source: default value)
> > > MCA btl: parameter "btl_sm_num_fifos" (current value: <1>, data source:
> > > default value)
> > > MCA btl: parameter "btl_sm_fifo_lazy_free" (current value: <120>, data
> > > source: default value)
> > > MCA btl: parameter "btl_sm_sm_extra_procs" (current value: <0>, data
> > > source: default value)
> > > MCA btl: parameter "btl_sm_exclusivity" (current value: <65535>, data
> > > source: default value) BTL exclusivity (must be >= 0)
> > > MCA btl: parameter "btl_sm_flags" (current value: <5>, data source:
> > > default value) BTL bit flags (general flags: SEND=1, PUT=2, GET=4,
> > > SEND_INPLACE=8, RDMA_MATCHED=64, HETEROGENEOUS_RDMA=256; flags only
> > > used by the "dr" PML (ignored by others): ACK=16, CHECKSUM=32,
> > > RDMA_COMPLETION=128; flags only used by the "bfo" PML (ignored by
> > > others): FAILOVER_SUPPORT=512) MCA btl: parameter
> > > "btl_sm_rndv_eager_limit" (current value: <4096>, data source: default
> > > value) Size (in bytes) of "phase 1" fragment sent for all large
> > > messages (must be >= 0 and <= eager_limit)
> > > MCA btl: parameter "btl_sm_eager_limit" (current value: <4096>, data
> > > source: default value) Maximum size (in bytes) of "short" messages
> > > (must be >= 1). MCA btl: parameter "btl_sm_max_send_size" (current
> > > value: <32768>, data source: default value) Maximum size (in bytes) of
> > > a single "phase 2" fragment of a long message when using the pipeline
> > > protocol (must be >= 1)
> > > MCA btl: parameter "btl_sm_bandwidth" (current value: <9000>, data
> > > source: default value) Approximate maximum bandwidth of interconnect(0
> > > = auto-detect value at run-time [not supported in all BTL modules], >=
> > > 1 = bandwidth in Mbps)
> > > MCA btl: parameter "btl_sm_latency" (current value: <1>, data source:
> > > default value) Approximate latency of interconnect (must be >= 0)
> > > MCA btl: parameter "btl_sm_priority" (current value: <0>, data source:
> > > default value)
> > > MCA btl: parameter "btl_base_warn_component_unused" (current value:
> > > <1>, data source: default value) This parameter is used to turn on
> > > warning messages when certain NICs are not used
> > >
> > > Matthias
> > >
> > > On Friday 02 March 2012 16:23:32 George Bosilca wrote:
> > > > Please do a "ompi_info --param btl sm" on your environment. The
> > > > lazy_free direct the internals of the SM BTL not to release the
> > > > memory fragments used to communicate until the lazy limit is
> > > > reached. The default value was deemed as reasonable a while back
> > > > when the number of default fragments was large. Lately there were
> > > > some patches to reduce the memory footprint of the SM BTL and these
> > > > might have lowered the available fragments to a limit where the
> > > > default value for the lazy_free is now too large.
> > > >
> > > > george.
> > > >
> > > > On Mar 2, 2012, at 10:08 , Matthias Jurenz wrote:
> > > > > In thanks to the OTPO tool, I figured out that setting the MCA
> > > > > parameter btl_sm_fifo_lazy_free to 1 (default is 120) improves the
> > > > > latency significantly: 0,88µs
> > > > >
> > > > > But somehow I get the feeling that this doesn't eliminate the
> > > > > actual problem...
> > > > >
> > > > > Matthias
> > > > >
> > > > > On Friday 02 March 2012 15:37:03 Matthias Jurenz wrote:
> > > > >> On Friday 02 March 2012 14:58:45 Jeffrey Squyres wrote:
> > > > >>> Ok. Good that there's no oversubscription bug, at least. :-)
> > > > >>>
> > > > >>> Did you see my off-list mail to you yesterday about building with
> > > > >>> an external copy of hwloc 1.4 to see if that helps?
> > > > >>
> > > > >> Yes, I did - I answered as well. Our mail server seems to be
> > > > >> something busy today...
> > > > >>
> > > > >> Just for the record: Using hwloc-1.4 makes no difference.
> > > > >>
> > > > >> Matthias
> > > > >>
> > > > >>> On Mar 2, 2012, at 8:26 AM, Matthias Jurenz wrote:
> > > > >>>> To exclude a possible bug within the LSF component, I rebuilt
> > > > >>>> Open MPI without support for LSF (--without-lsf).
> > > > >>>>
> > > > >>>> -> It makes no difference - the latency is still bad: ~1.1us.
> > > > >>>>
> > > > >>>> Matthias
> > > > >>>>
> > > > >>>> On Friday 02 March 2012 13:50:13 Matthias Jurenz wrote:
> > > > >>>>> SORRY, it was obviously a big mistake by me. :-(
> > > > >>>>>
> > > > >>>>> Open MPI 1.5.5 was built with LSF support, so when starting an
> > > > >>>>> LSF job it's necessary to request at least the number of
> > > > >>>>> tasks/cores as used for the subsequent mpirun command. That was
> > > > >>>>> not the case - I forgot the bsub's '-n' option to specify the
> > > > >>>>> number of task, so only *one* task/core was requested.
> > > > >>>>>
> > > > >>>>> Open MPI 1.4.5 was built *without* LSF support, so the supposed
> > > > >>>>> misbehavior could not happen with it.
> > > > >>>>>
> > > > >>>>> In short, there is no bug in Open MPI 1.5.x regarding to the
> > > > >>>>> detection of oversubscription. Sorry for any confusion!
> > > > >>>>>
> > > > >>>>> Matthias
> > > > >>>>>
> > > > >>>>> On Tuesday 28 February 2012 13:36:56 Matthias Jurenz wrote:
> > > > >>>>>> When using Open MPI v1.4.5 I get ~1.1us. That's the same
> > > > >>>>>> result as I get with Open MPI v1.5.x using
> > > > >>>>>> mpi_yield_when_idle=0. So I think there is a bug in Open MPI
> > > > >>>>>> (v1.5.4 and v1.5.5rc2) regarding to the automatic performance
> > > > >>>>>> mode selection.
> > > > >>>>>>
> > > > >>>>>> When enabling the degraded performance mode for Open MPI 1.4.5
> > > > >>>>>> (mpi_yield_when_idle=1) I get ~1.8us latencies.
> > > > >>>>>>
> > > > >>>>>> Matthias
> > > > >>>>>>
> > > > >>>>>> On Tuesday 28 February 2012 06:20:28 Christopher Samuel wrote:
> > > > >>>>>>> On 13/02/12 22:11, Matthias Jurenz wrote:
> > > > >>>>>>>> Do you have any idea? Please help!
> > > > >>>>>>>
> > > > >>>>>>> Do you see the same bad latency in the old branch (1.4.5) ?
> > > > >>>>>>>
> > > > >>>>>>> cheers,
> > > > >>>>>>> Chris
> > > > >>>>>>
> > > > >>>>>> _______________________________________________
> > > > >>>>>> devel mailing list
> > > > >>>>>> devel_at_[hidden]
> > > > >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > > >>>>>
> > > > >>>>> _______________________________________________
> > > > >>>>> devel mailing list
> > > > >>>>> devel_at_[hidden]
> > > > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > > >>>>
> > > > >>>> _______________________________________________
> > > > >>>> devel mailing list
> > > > >>>> devel_at_[hidden]
> > > > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > > >>
> > > > >> _______________________________________________
> > > > >> devel mailing list
> > > > >> devel_at_[hidden]
> > > > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > > >
> > > > > _______________________________________________
> > > > > devel mailing list
> > > > > devel_at_[hidden]
> > > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > >
> > > > _______________________________________________
> > > > devel mailing list
> > > > devel_at_[hidden]
> > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >
> > > _______________________________________________
> > > devel mailing list
> > > devel_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel