Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] poor btl sm latency
From: Matthias Jurenz (matthias.jurenz_at_[hidden])
Date: 2012-02-20 07:46:54


If the processes are bound for L2 sharing (i.e. using neighboring cores pu:0
and pu:1) I get the *worst* latency results:

$ mpiexec -np 1 hwloc-bind pu:0 ./NPmpi -S -u 4 -n 100000 : -np 1 hwloc-bind
pu:1 ./NPmpi -S -u 4 -n 100000
Using synchronous sends
Using synchronous sends
0: n023
1: n023
Now starting the main loop
  0: 1 bytes 100000 times --> 3.54 Mbps in 2.16 usec
  1: 2 bytes 100000 times --> 7.10 Mbps in 2.15 usec
  2: 3 bytes 100000 times --> 10.68 Mbps in 2.14 usec
  3: 4 bytes 100000 times --> 14.23 Mbps in 2.15 usec

As it should, I get the same result when using '-bind-to-core' *without* '--
cpus-per-proc 2'.

When using two separate L2's (pu:0,pu:2 or '--cpus-per-proc 2') I get better
results:

$ mpiexec -np 1 hwloc-bind pu:0 ./NPmpi -S -u 4 -n 100000 : -np 1 hwloc-bind
pu:2 ./NPmpi -S -u 4 -n 100000
Using synchronous sends
0: n023
Using synchronous sends
1: n023
Now starting the main loop
  0: 1 bytes 100000 times --> 5.15 Mbps in 1.48 usec
  1: 2 bytes 100000 times --> 10.15 Mbps in 1.50 usec
  2: 3 bytes 100000 times --> 15.26 Mbps in 1.50 usec
  3: 4 bytes 100000 times --> 20.23 Mbps in 1.51 usec

So it seems that the process binding within Open MPI works and retires as
reason for the bad latency :-(

Matthias

On Thursday 16 February 2012 17:51:53 Brice Goglin wrote:
> Le 16/02/2012 17:12, Matthias Jurenz a écrit :
> > Thanks for the hint, Brice.
> > I'll forward this bug report to our cluster vendor.
> >
> > Could this be the reason for the bad latencies with Open MPI or does it
> > only affect hwloc/lstopo?
>
> It affects binding. So it may affect the performance you observed when
> using "high-level" binding policies that end up binding on wrong cores
> because of hwloc/kernel problems. If you specify binding manually, it
> shouldn't hurt.
>
> If the best latency case is supposed to be when L2 is shared, then try:
> mpiexec -np 1 hwloc-bind pu:0 ./all2all : -np 1 hwloc-bind pu:1
> ./all2all
> Then, we'll see if you can get the same result with one of OMPI binding
> options.
>
> Brice
>
> > Matthias
> >
> > On Thursday 16 February 2012 15:46:46 Brice Goglin wrote:
> >> Le 16/02/2012 15:39, Matthias Jurenz a écrit :
> >>> Here the output of lstopo from a single compute node. I'm wondering
> >>> that the fact of L1/L2 sharing isn't visible - also not in the
> >>> graphical output...
> >>
> >> That's a kernel bug. We're waiting for AMD to tell the kernel that L1i
> >> and L2 are shared across dual-core modules. If you have some contact at
> >> AMD, please tell them to look at
> >> https://bugzilla.kernel.org/show_bug.cgi?id=42607
> >>
> >> Brice
> >>
> >> _______________________________________________
> >> devel mailing list
> >> devel_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel