Given the oversubscription on the existing HT links, could contention account for the difference? (I have no idea how HT's contention management works) Meaning: if the stars line up in a given run, you could end up with very little/no contention and you get good bandwidth. But if there's a bit of jitter, you could end up with quite a bit of contention that ends up cascading into a bunch of additional delay.
I fail to see how that could add up to 70-80 (or more) seconds of difference -- 13 secs vs. 90+ seconds (and more), though... 70-80 seconds sounds like an IO delay -- perhaps paging due to the ramdisk or somesuch...? That's a SWAG.
On Jul 15, 2010, at 10:40 AM, Jed Brown wrote:
> On Thu, 15 Jul 2010 09:36:18 -0400, Jeff Squyres <jsquyres_at_[hidden]> wrote:
> > Per my other disclaimer, I'm trolling through my disastrous inbox and
> > finding some orphaned / never-answered emails. Sorry for the delay!
> No problem, I should have followed up on this with further explanation.
> > Just to be clear -- you're running 8 procs locally on an 8 core node,
> > right?
> These are actually 4-socket quad-core nodes, so there are 16 cores
> available, but we are only running on 8, -npersocket 2 -bind-to-socket.
> This was a greatly simplified case, but is still sufficient to show the
> variability. It tends to be somewhat worse if we use all cores of a
> (Cisco is an Intel partner -- I don't follow the AMD line
> > much) So this should all be local communication with no external
> > network involved, right?
> Yes, this was the greatly simplified case, contained entirely within a
> > > lsf.o240562 killed 8*a6200
> > > lsf.o240563 9.2110e+01 8*a6200
> > > lsf.o240564 1.5638e+01 8*a6237
> > > lsf.o240565 1.3873e+01 8*a6228
> > Am I reading that right that it's 92 seconds vs. 13 seconds? Woof!
> Yes, an the "killed" means it wasn't done after 120 seconds. This
> factor of 10 is about the worst we see, but of course very surprising.
> > Nice and consistent, as you mentioned. And I assume your notation
> > here means that it's across 2 nodes.
> Yes, the Quadrics nodes are 2-socket dual core, so 8 procs needs two
> The rest of your observations are consistent with my understanding. We
> identified two other issues, neither of which accounts for a factor of
> 10, but which account for at least a factor of 3.
> 1. The administrators mounted a 16 GB ramdisk on /scratch, but did not
> ensure that it was wiped before the next task ran. So if you got a
> node after some job that left stinky feces there, you could
> effectively only have 16 GB (before the old stuff would be swapped
> out). More importantly, the physical pages backing the ramdisk may
> not be uniformly distributed across the sockets, and rather than
> preemptively swap out those old ramdisk pages, the kernel would find
> a page on some other socket (instead of locally, this could be
> confirmed, for example, by watching the numa_foreign and numa_miss
> counts with numastat). Then when you went to use that memory
> (typically in a bandwidth-limited application), it was easy to have 3
> sockets all waiting on one bus, thus taking a factor of 3+
> performance hit despite a resident set much less than 50% of the
> available memory. I have a rather complete analysis of this in case
> someone is interested. Note that this can affect programs with
> static or dynamic allocation (the kernel looks for local pages when
> you fault it, not when you allocate it), the only way I know of to
> circumvent the problem is to allocate memory with libnuma
> (e.g. numa_alloc_local) which will fail if local memory isn't
> available (instead of returning and subsequently faulting remote
> 2. The memory bandwidth is 16-18% different between sockets, with
> sockets 0,3 being slow and sockets 1,2 having much faster available
> bandwidth. This is fully reproducible and acknowledged by
> Sun/Oracle, their response to an early inquiry:
> I am not completely happy with this explanation because the issue
> persists even with full software prefetch, packed SSE2, and
> non-temporal stores; as long as the working set does not fit within
> (per-socket) L3. Note that the software prefetch allows for several
> hundred cycles of latency, so the extra hop for snooping shouldn't be
> a problem. If the working set fits within L3, then all sockets are
> the same speed (and of course much faster due to improved bandwidth).
> Some disassembly here:
> The three with prefetch and movntpd run within 2% of each other, the
> other is much faster within cache and much slower when it breaks out
> of cache (obviously). The performance numbers are higher than with
> the reference implementation (quoted in Sun/Oracle's repsonse), but
> (run with taskset to each of the four sockets):
> Triad: 5842.5814 0.0329 0.0329 0.0330
> Triad: 6843.4206 0.0281 0.0281 0.0282
> Triad: 6827.6390 0.0282 0.0281 0.0283
> Triad: 5862.0601 0.0329 0.0328 0.0331
> This is almost exclusively due to the prefetching, the packed
> arithmetic is almost completely inconsequential when waiting on
> memory bandwidth.
For corporate legal information go to: