Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] poor btl sm latency
From: Matthias Jurenz (matthias.jurenz_at_[hidden])
Date: 2012-03-12 06:09:01


It's a SUSE Linux Enterprise Server 11 Service Pack 1 with kernel version
2.6.32.49-0.3-default.

Matthias

On Friday 09 March 2012 16:36:41 you wrote:
> What OS are you using ?
>
> Joshua
>
> ----- Original Message -----
> From: Matthias Jurenz [mailto:matthias.jurenz_at_[hidden]]
> Sent: Friday, March 09, 2012 08:50 AM
> To: Open MPI Developers <devel_at_[hidden]>
> Cc: Mora, Joshua
> Subject: Re: [OMPI devel] poor btl sm latency
>
> I just made an interesting observation:
>
> When binding the processes to two neighboring cores (L2 sharing) NetPIPE
> shows *sometimes* pretty good results: ~0.5us
>
> $ mpirun -mca btl sm,self -np 1 hwloc-bind -v core:0 ./NPmpi_ompi1.5.5 -u 4
> -n 100000 -p 0 : -np 1 hwloc-bind -v core:1 ./NPmpi_ompi1.5.5 -u 4 -n
> 100000 -p 0 using object #0 depth 6 below cpuset 0xffffffff,0xffffffff
> using object #1 depth 6 below cpuset 0xffffffff,0xffffffff
> adding 0x00000001 to 0x0
> adding 0x00000001 to 0x0
> assuming the command starts at ./NPmpi_ompi1.5.5
> binding on cpu set 0x00000001
> adding 0x00000002 to 0x0
> adding 0x00000002 to 0x0
> assuming the command starts at ./NPmpi_ompi1.5.5
> binding on cpu set 0x00000002
> Using no perturbations
>
> 0: n035
> Using no perturbations
>
> 1: n035
> Now starting the main loop
> 0: 1 bytes 100000 times --> 6.01 Mbps in 1.27 usec
> 1: 2 bytes 100000 times --> 12.04 Mbps in 1.27 usec
> 2: 3 bytes 100000 times --> 18.07 Mbps in 1.27 usec
> 3: 4 bytes 100000 times --> 24.13 Mbps in 1.26 usec
>
> $ mpirun -mca btl sm,self -np 1 hwloc-bind -v core:0 ./NPmpi_ompi1.5.5 -u 4
> -n 100000 -p 0 : -np 1 hwloc-bind -v core:1 ./NPmpi_ompi1.5.5 -u 4 -n
> 100000 -p 0 using object #0 depth 6 below cpuset 0xffffffff,0xffffffff
> adding 0x00000001 to 0x0
> adding 0x00000001 to 0x0
> assuming the command starts at ./NPmpi_ompi1.5.5
> binding on cpu set 0x00000001
> using object #1 depth 6 below cpuset 0xffffffff,0xffffffff
> adding 0x00000002 to 0x0
> adding 0x00000002 to 0x0
> assuming the command starts at ./NPmpi_ompi1.5.5
> binding on cpu set 0x00000002
> Using no perturbations
>
> 0: n035
> Using no perturbations
>
> 1: n035
> Now starting the main loop
> 0: 1 bytes 100000 times --> 12.96 Mbps in 0.59 usec
> 1: 2 bytes 100000 times --> 25.78 Mbps in 0.59 usec
> 2: 3 bytes 100000 times --> 38.62 Mbps in 0.59 usec
> 3: 4 bytes 100000 times --> 52.88 Mbps in 0.58 usec
>
> I can reproduce that approximately every tenth run.
>
> When binding the processes for exclusive L2 caches (e.g. core 0 and 2) I
> get constant latencies ~1.1us
>
> Matthias
>
> On Monday 05 March 2012 09:52:39 Matthias Jurenz wrote:
> > Here the SM BTL parameters:
> >
> > $ ompi_info --param btl sm
> > MCA btl: parameter "btl_base_verbose" (current value: <0>, data source:
> > default value) Verbosity level of the BTL framework
> > MCA btl: parameter "btl" (current value: <self,sm,openib>, data source:
> > file
> > [/sw/atlas/libraries/openmpi/1.5.5rc3/x86_64/etc/openmpi-mca-params.conf]
> > ) Default selection set of components for the btl framework (<none> means
> > use all components that can be found)
> > MCA btl: information "btl_sm_have_knem_support" (value: <1>, data source:
> > default value) Whether this component supports the knem Linux kernel
> > module or not
> > MCA btl: parameter "btl_sm_use_knem" (current value: <-1>, data source:
> > default value) Whether knem support is desired or not (negative = try to
> > enable knem support, but continue even if it is not available, 0 = do not
> > enable knem support, positive = try to enable knem support and fail if it
> > is not available)
> > MCA btl: parameter "btl_sm_knem_dma_min" (current value: <0>, data
> > source: default value) Minimum message size (in bytes) to use the knem
> > DMA mode; ignored if knem does not support DMA mode (0 = do not use the
> > knem DMA mode) MCA btl: parameter "btl_sm_knem_max_simultaneous"
> > (current value: <0>, data source: default value) Max number of
> > simultaneous ongoing knem operations to support (0 = do everything
> > synchronously, which probably gives the best large message latency; >0
> > means to do all operations asynchronously, which supports better overlap
> > for simultaneous large message sends)
> > MCA btl: parameter "btl_sm_free_list_num" (current value: <8>, data
> > source: default value)
> > MCA btl: parameter "btl_sm_free_list_max" (current value: <-1>, data
> > source: default value)
> > MCA btl: parameter "btl_sm_free_list_inc" (current value: <64>, data
> > source: default value)
> > MCA btl: parameter "btl_sm_max_procs" (current value: <-1>, data source:
> > default value)
> > MCA btl: parameter "btl_sm_mpool" (current value: <sm>, data source:
> > default value)
> > MCA btl: parameter "btl_sm_fifo_size" (current value: <4096>, data
> > source: default value)
> > MCA btl: parameter "btl_sm_num_fifos" (current value: <1>, data source:
> > default value)
> > MCA btl: parameter "btl_sm_fifo_lazy_free" (current value: <120>, data
> > source: default value)
> > MCA btl: parameter "btl_sm_sm_extra_procs" (current value: <0>, data
> > source: default value)
> > MCA btl: parameter "btl_sm_exclusivity" (current value: <65535>, data
> > source: default value) BTL exclusivity (must be >= 0)
> > MCA btl: parameter "btl_sm_flags" (current value: <5>, data source:
> > default value) BTL bit flags (general flags: SEND=1, PUT=2, GET=4,
> > SEND_INPLACE=8, RDMA_MATCHED=64, HETEROGENEOUS_RDMA=256; flags only used
> > by the "dr" PML (ignored by others): ACK=16, CHECKSUM=32,
> > RDMA_COMPLETION=128; flags only used by the "bfo" PML (ignored by
> > others): FAILOVER_SUPPORT=512) MCA btl: parameter
> > "btl_sm_rndv_eager_limit" (current value: <4096>, data source: default
> > value) Size (in bytes) of "phase 1" fragment sent for all large messages
> > (must be >= 0 and <= eager_limit)
> > MCA btl: parameter "btl_sm_eager_limit" (current value: <4096>, data
> > source: default value) Maximum size (in bytes) of "short" messages (must
> > be >= 1). MCA btl: parameter "btl_sm_max_send_size" (current value:
> > <32768>, data source: default value) Maximum size (in bytes) of a single
> > "phase 2" fragment of a long message when using the pipeline protocol
> > (must be >= 1)
> > MCA btl: parameter "btl_sm_bandwidth" (current value: <9000>, data
> > source: default value) Approximate maximum bandwidth of interconnect(0 =
> > auto-detect value at run-time [not supported in all BTL modules], >= 1 =
> > bandwidth in Mbps)
> > MCA btl: parameter "btl_sm_latency" (current value: <1>, data source:
> > default value) Approximate latency of interconnect (must be >= 0)
> > MCA btl: parameter "btl_sm_priority" (current value: <0>, data source:
> > default value)
> > MCA btl: parameter "btl_base_warn_component_unused" (current value: <1>,
> > data source: default value) This parameter is used to turn on warning
> > messages when certain NICs are not used
> >
> > Matthias
> >
> > On Friday 02 March 2012 16:23:32 George Bosilca wrote:
> > > Please do a "ompi_info --param btl sm" on your environment. The
> > > lazy_free direct the internals of the SM BTL not to release the memory
> > > fragments used to communicate until the lazy limit is reached. The
> > > default value was deemed as reasonable a while back when the number of
> > > default fragments was large. Lately there were some patches to reduce
> > > the memory footprint of the SM BTL and these might have lowered the
> > > available fragments to a limit where the default value for the
> > > lazy_free is now too large.
> > >
> > > george.
> > >
> > > On Mar 2, 2012, at 10:08 , Matthias Jurenz wrote:
> > > > In thanks to the OTPO tool, I figured out that setting the MCA
> > > > parameter btl_sm_fifo_lazy_free to 1 (default is 120) improves the
> > > > latency significantly: 0,88µs
> > > >
> > > > But somehow I get the feeling that this doesn't eliminate the actual
> > > > problem...
> > > >
> > > > Matthias
> > > >
> > > > On Friday 02 March 2012 15:37:03 Matthias Jurenz wrote:
> > > >> On Friday 02 March 2012 14:58:45 Jeffrey Squyres wrote:
> > > >>> Ok. Good that there's no oversubscription bug, at least. :-)
> > > >>>
> > > >>> Did you see my off-list mail to you yesterday about building with
> > > >>> an external copy of hwloc 1.4 to see if that helps?
> > > >>
> > > >> Yes, I did - I answered as well. Our mail server seems to be
> > > >> something busy today...
> > > >>
> > > >> Just for the record: Using hwloc-1.4 makes no difference.
> > > >>
> > > >> Matthias
> > > >>
> > > >>> On Mar 2, 2012, at 8:26 AM, Matthias Jurenz wrote:
> > > >>>> To exclude a possible bug within the LSF component, I rebuilt Open
> > > >>>> MPI without support for LSF (--without-lsf).
> > > >>>>
> > > >>>> -> It makes no difference - the latency is still bad: ~1.1us.
> > > >>>>
> > > >>>> Matthias
> > > >>>>
> > > >>>> On Friday 02 March 2012 13:50:13 Matthias Jurenz wrote:
> > > >>>>> SORRY, it was obviously a big mistake by me. :-(
> > > >>>>>
> > > >>>>> Open MPI 1.5.5 was built with LSF support, so when starting an
> > > >>>>> LSF job it's necessary to request at least the number of
> > > >>>>> tasks/cores as used for the subsequent mpirun command. That was
> > > >>>>> not the case - I forgot the bsub's '-n' option to specify the
> > > >>>>> number of task, so only *one* task/core was requested.
> > > >>>>>
> > > >>>>> Open MPI 1.4.5 was built *without* LSF support, so the supposed
> > > >>>>> misbehavior could not happen with it.
> > > >>>>>
> > > >>>>> In short, there is no bug in Open MPI 1.5.x regarding to the
> > > >>>>> detection of oversubscription. Sorry for any confusion!
> > > >>>>>
> > > >>>>> Matthias
> > > >>>>>
> > > >>>>> On Tuesday 28 February 2012 13:36:56 Matthias Jurenz wrote:
> > > >>>>>> When using Open MPI v1.4.5 I get ~1.1us. That's the same result
> > > >>>>>> as I get with Open MPI v1.5.x using mpi_yield_when_idle=0. So I
> > > >>>>>> think there is a bug in Open MPI (v1.5.4 and v1.5.5rc2)
> > > >>>>>> regarding to the automatic performance mode selection.
> > > >>>>>>
> > > >>>>>> When enabling the degraded performance mode for Open MPI 1.4.5
> > > >>>>>> (mpi_yield_when_idle=1) I get ~1.8us latencies.
> > > >>>>>>
> > > >>>>>> Matthias
> > > >>>>>>
> > > >>>>>> On Tuesday 28 February 2012 06:20:28 Christopher Samuel wrote:
> > > >>>>>>> On 13/02/12 22:11, Matthias Jurenz wrote:
> > > >>>>>>>> Do you have any idea? Please help!
> > > >>>>>>>
> > > >>>>>>> Do you see the same bad latency in the old branch (1.4.5) ?
> > > >>>>>>>
> > > >>>>>>> cheers,
> > > >>>>>>> Chris
> > > >>>>>>
> > > >>>>>> _______________________________________________
> > > >>>>>> devel mailing list
> > > >>>>>> devel_at_[hidden]
> > > >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > >>>>>
> > > >>>>> _______________________________________________
> > > >>>>> devel mailing list
> > > >>>>> devel_at_[hidden]
> > > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > >>>>
> > > >>>> _______________________________________________
> > > >>>> devel mailing list
> > > >>>> devel_at_[hidden]
> > > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > >>
> > > >> _______________________________________________
> > > >> devel mailing list
> > > >> devel_at_[hidden]
> > > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > >
> > > > _______________________________________________
> > > > devel mailing list
> > > > devel_at_[hidden]
> > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >
> > > _______________________________________________
> > > devel mailing list
> > > devel_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel