On Jul 6, 2013, at 4:59 PM, Michael Thomadakis <firstname.lastname@example.org> wrote:It is not *necessary* to do ensure that buffers are NUMA-local to the PCI device that they are writing to, but it certainly results in lower latency to read/write to PCI devices (regardless of flavor) that are attached to an MPI process' local NUMA node. The Hardware Locality (hwloc) tool "lstopo" can print a pretty picture of your server to show you where your PCI busses are connected.
> When you stack runs on SandyBridge nodes atached to HCAs ove PCI3 gen 3 do you pay any special attention to the memory buffers according to which socket/memory controller their physical memory belongs to?
> For instance, if the HCA is attached to the PCIgen3 lanes of Socket 1 do you do anything special when the read/write buffers map to physical memory belonging to Socket 2? Or do you7 avoid using buffers mapping ro memory that belongs (is accessible via) the other socket?
For TCP, Open MPI will use all TCP devices that it finds by default (because it is assumed that latency is so high that NUMA locality doesn't matter). The openib (OpenFabrics) transport will use the "closest" HCA ports that it can find to each MPI process.
In our upcoming Cisco ultra low latency BTL, it defaults to using the closest Cisco VIC ports that it can find for short messages (i.e., to minimize latency), but uses all available VICs for long messages (i.e., to maximize bandwidth).
It's the same overall architecture (i.e., NUMA).
> Has this situation improved with Ivy-Brige systems or Haswell?
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
users mailing list