On Thu, 15 Jul 2010 09:36:18 -0400, Jeff Squyres <jsquyres_at_[hidden]> wrote:
> Per my other disclaimer, I'm trolling through my disastrous inbox and
> finding some orphaned / never-answered emails. Sorry for the delay!
No problem, I should have followed up on this with further explanation.
> Just to be clear -- you're running 8 procs locally on an 8 core node,
These are actually 4-socket quad-core nodes, so there are 16 cores
available, but we are only running on 8, -npersocket 2 -bind-to-socket.
This was a greatly simplified case, but is still sufficient to show the
variability. It tends to be somewhat worse if we use all cores of a
(Cisco is an Intel partner -- I don't follow the AMD line
> much) So this should all be local communication with no external
> network involved, right?
Yes, this was the greatly simplified case, contained entirely within a
> > lsf.o240562 killed 8*a6200
> > lsf.o240563 9.2110e+01 8*a6200
> > lsf.o240564 1.5638e+01 8*a6237
> > lsf.o240565 1.3873e+01 8*a6228
> Am I reading that right that it's 92 seconds vs. 13 seconds? Woof!
Yes, an the "killed" means it wasn't done after 120 seconds. This
factor of 10 is about the worst we see, but of course very surprising.
> Nice and consistent, as you mentioned. And I assume your notation
> here means that it's across 2 nodes.
Yes, the Quadrics nodes are 2-socket dual core, so 8 procs needs two
The rest of your observations are consistent with my understanding. We
identified two other issues, neither of which accounts for a factor of
10, but which account for at least a factor of 3.
1. The administrators mounted a 16 GB ramdisk on /scratch, but did not
ensure that it was wiped before the next task ran. So if you got a
node after some job that left stinky feces there, you could
effectively only have 16 GB (before the old stuff would be swapped
out). More importantly, the physical pages backing the ramdisk may
not be uniformly distributed across the sockets, and rather than
preemptively swap out those old ramdisk pages, the kernel would find
a page on some other socket (instead of locally, this could be
confirmed, for example, by watching the numa_foreign and numa_miss
counts with numastat). Then when you went to use that memory
(typically in a bandwidth-limited application), it was easy to have 3
sockets all waiting on one bus, thus taking a factor of 3+
performance hit despite a resident set much less than 50% of the
available memory. I have a rather complete analysis of this in case
someone is interested. Note that this can affect programs with
static or dynamic allocation (the kernel looks for local pages when
you fault it, not when you allocate it), the only way I know of to
circumvent the problem is to allocate memory with libnuma
(e.g. numa_alloc_local) which will fail if local memory isn't
available (instead of returning and subsequently faulting remote
2. The memory bandwidth is 16-18% different between sockets, with
sockets 0,3 being slow and sockets 1,2 having much faster available
bandwidth. This is fully reproducible and acknowledged by
Sun/Oracle, their response to an early inquiry:
I am not completely happy with this explanation because the issue
persists even with full software prefetch, packed SSE2, and
non-temporal stores; as long as the working set does not fit within
(per-socket) L3. Note that the software prefetch allows for several
hundred cycles of latency, so the extra hop for snooping shouldn't be
a problem. If the working set fits within L3, then all sockets are
the same speed (and of course much faster due to improved bandwidth).
Some disassembly here:
The three with prefetch and movntpd run within 2% of each other, the
other is much faster within cache and much slower when it breaks out
of cache (obviously). The performance numbers are higher than with
the reference implementation (quoted in Sun/Oracle's repsonse), but
(run with taskset to each of the four sockets):
Triad: 5842.5814 0.0329 0.0329 0.0330
Triad: 6843.4206 0.0281 0.0281 0.0282
Triad: 6827.6390 0.0282 0.0281 0.0283
Triad: 5862.0601 0.0329 0.0328 0.0331
This is almost exclusively due to the prefetching, the packed
arithmetic is almost completely inconsequential when waiting on