Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Highly variable performance
From: Jed Brown (jed_at_[hidden])
Date: 2010-07-15 10:40:52


On Thu, 15 Jul 2010 09:36:18 -0400, Jeff Squyres <jsquyres_at_[hidden]> wrote:
> Per my other disclaimer, I'm trolling through my disastrous inbox and
> finding some orphaned / never-answered emails. Sorry for the delay!

No problem, I should have followed up on this with further explanation.

> Just to be clear -- you're running 8 procs locally on an 8 core node,
> right?

These are actually 4-socket quad-core nodes, so there are 16 cores
available, but we are only running on 8, -npersocket 2 -bind-to-socket.
This was a greatly simplified case, but is still sufficient to show the
variability. It tends to be somewhat worse if we use all cores of a
node.

  (Cisco is an Intel partner -- I don't follow the AMD line
> much) So this should all be local communication with no external
> network involved, right?

Yes, this was the greatly simplified case, contained entirely within a

> > lsf.o240562 killed 8*a6200
> > lsf.o240563 9.2110e+01 8*a6200
> > lsf.o240564 1.5638e+01 8*a6237
> > lsf.o240565 1.3873e+01 8*a6228
>
> Am I reading that right that it's 92 seconds vs. 13 seconds? Woof!

Yes, an the "killed" means it wasn't done after 120 seconds. This
factor of 10 is about the worst we see, but of course very surprising.

> Nice and consistent, as you mentioned. And I assume your notation
> here means that it's across 2 nodes.

Yes, the Quadrics nodes are 2-socket dual core, so 8 procs needs two
nodes.

The rest of your observations are consistent with my understanding. We
identified two other issues, neither of which accounts for a factor of
10, but which account for at least a factor of 3.

1. The administrators mounted a 16 GB ramdisk on /scratch, but did not
   ensure that it was wiped before the next task ran. So if you got a
   node after some job that left stinky feces there, you could
   effectively only have 16 GB (before the old stuff would be swapped
   out). More importantly, the physical pages backing the ramdisk may
   not be uniformly distributed across the sockets, and rather than
   preemptively swap out those old ramdisk pages, the kernel would find
   a page on some other socket (instead of locally, this could be
   confirmed, for example, by watching the numa_foreign and numa_miss
   counts with numastat). Then when you went to use that memory
   (typically in a bandwidth-limited application), it was easy to have 3
   sockets all waiting on one bus, thus taking a factor of 3+
   performance hit despite a resident set much less than 50% of the
   available memory. I have a rather complete analysis of this in case
   someone is interested. Note that this can affect programs with
   static or dynamic allocation (the kernel looks for local pages when
   you fault it, not when you allocate it), the only way I know of to
   circumvent the problem is to allocate memory with libnuma
   (e.g. numa_alloc_local) which will fail if local memory isn't
   available (instead of returning and subsequently faulting remote
   pages).

2. The memory bandwidth is 16-18% different between sockets, with
   sockets 0,3 being slow and sockets 1,2 having much faster available
   bandwidth. This is fully reproducible and acknowledged by
   Sun/Oracle, their response to an early inquiry:

     http://59A2.org/files/SunBladeX6440STREAM-20100616.pdf

   I am not completely happy with this explanation because the issue
   persists even with full software prefetch, packed SSE2, and
   non-temporal stores; as long as the working set does not fit within
   (per-socket) L3. Note that the software prefetch allows for several
   hundred cycles of latency, so the extra hop for snooping shouldn't be
   a problem. If the working set fits within L3, then all sockets are
   the same speed (and of course much faster due to improved bandwidth).
   Some disassembly here:

     http://gist.github.com/476942

   The three with prefetch and movntpd run within 2% of each other, the
   other is much faster within cache and much slower when it breaks out
   of cache (obviously). The performance numbers are higher than with
   the reference implementation (quoted in Sun/Oracle's repsonse), but
   (run with taskset to each of the four sockets):

     Triad: 5842.5814 0.0329 0.0329 0.0330
     Triad: 6843.4206 0.0281 0.0281 0.0282
     Triad: 6827.6390 0.0282 0.0281 0.0283
     Triad: 5862.0601 0.0329 0.0328 0.0331

   This is almost exclusively due to the prefetching, the packed
   arithmetic is almost completely inconsequential when waiting on
   memory bandwidth.

Jed