thanks for the reply.
The issue is that when you read or write PCIe_gen 3 dat to a non-local NUMA
memory, SandyBridge will use the inter-socket QPIs to get this data across
to the other socket. I think there is considerable limitation in PCIe I/O
traffic data going over the inter-socket QPI. One way to get around this is
for reads to buffer all data into memory space local to the same socket and
then transfer them by code across to the other socket's physical memory.
For writes the same approach can be used with intermediary process copying
I was wondering if OpenMPI does anything special memory mapping to work
around this. And if with Ivy Bridge (or Haswell) he situation has improved.
On Mon, Jul 8, 2013 at 9:57 AM, Jeff Squyres (jsquyres)
> On Jul 6, 2013, at 4:59 PM, Michael Thomadakis <drmichaelt7777_at_[hidden]>
> > When you stack runs on SandyBridge nodes atached to HCAs ove PCI3 gen 3
> do you pay any special attention to the memory buffers according to which
> socket/memory controller their physical memory belongs to?
> > For instance, if the HCA is attached to the PCIgen3 lanes of Socket 1 do
> you do anything special when the read/write buffers map to physical memory
> belonging to Socket 2? Or do you7 avoid using buffers mapping ro memory
> that belongs (is accessible via) the other socket?
> It is not *necessary* to do ensure that buffers are NUMA-local to the PCI
> device that they are writing to, but it certainly results in lower latency
> to read/write to PCI devices (regardless of flavor) that are attached to an
> MPI process' local NUMA node. The Hardware Locality (hwloc) tool "lstopo"
> can print a pretty picture of your server to show you where your PCI busses
> are connected.
> For TCP, Open MPI will use all TCP devices that it finds by default
> (because it is assumed that latency is so high that NUMA locality doesn't
> matter). The openib (OpenFabrics) transport will use the "closest" HCA
> ports that it can find to each MPI process.
> In our upcoming Cisco ultra low latency BTL, it defaults to using the
> closest Cisco VIC ports that it can find for short messages (i.e., to
> minimize latency), but uses all available VICs for long messages (i.e., to
> maximize bandwidth).
> > Has this situation improved with Ivy-Brige systems or Haswell?
> It's the same overall architecture (i.e., NUMA).
> Jeff Squyres
> For corporate legal information go to:
> users mailing list