On a dual E5 2650 machine with FDR cards, I see the IMB Pingpong
throughput drop from 6000 to 5700MB/s when the memory isn't allocated on
the right socket (and latency increases from 0.8 to 1.4us). Of course
that's pingpong only, things will be worse on a memory-overloaded
machine. But I don't expect things to be "less worse" if you do an
intermediate copy through the memory near the HCA: you would overload
the QPI link as much as here, and you would overload the CPU even more
because of the additional copies.
Le 08/07/2013 18:27, Michael Thomadakis a écrit :
> People have mentioned that they experience unexpected slow downs in
> PCIe_gen3 I/O when the pages map to a socket different from the one
> the HCA connects to. It is speculated that the inter-socket QPI is not
> provisioned to transfer more than 1GiB/sec for PCIe_gen 3 traffic.
> This situation may not be in effect on all SandyBrige or IvyBridge
> Have you measured anything like this on you systems as well? That
> would require using physical memory mapped to the socket w/o HCA
> exclusively for MPI messaging.
> On Mon, Jul 8, 2013 at 10:52 AM, Jeff Squyres (jsquyres)
> <jsquyres_at_[hidden] <mailto:jsquyres_at_[hidden]>> wrote:
> On Jul 8, 2013, at 11:35 AM, Michael Thomadakis
> <drmichaelt7777_at_[hidden] <mailto:drmichaelt7777_at_[hidden]>> wrote:
> > The issue is that when you read or write PCIe_gen 3 dat to a
> non-local NUMA memory, SandyBridge will use the inter-socket QPIs
> to get this data across to the other socket. I think there is
> considerable limitation in PCIe I/O traffic data going over the
> inter-socket QPI. One way to get around this is for reads to
> buffer all data into memory space local to the same socket and
> then transfer them by code across to the other socket's physical
> memory. For writes the same approach can be used with intermediary
> process copying data.
> Sure, you'll cause congestion across the QPI network when you do
> non-local PCI reads/writes. That's a given.
> But I'm not aware of a hardware limitation on PCI-requested
> traffic across QPI (I could be wrong, of course -- I'm a software
> guy, not a hardware guy). A simple test would be to bind an MPI
> process to a far NUMA node and run a simple MPI bandwidth test and
> see if to get better/same/worse bandwidth compared to binding an
> MPI process on a near NUMA socket.
> But in terms of doing intermediate (pipelined) reads/writes to
> local NUMA memory before reading/writing to PCI, no, Open MPI does
> not do this. Unless there is a PCI-QPI bandwidth constraint that
> we're unaware of, I'm not sure why you would do this -- it would
> likely add considerable complexity to the code and it would
> definitely lead to higher overall MPI latency.
> Don't forget that the MPI paradigm is for the application to
> provide the send/receive buffer. Meaning: MPI doesn't (always)
> control where the buffer is located (particularly for large messages).
> > I was wondering if OpenMPI does anything special memory mapping
> to work around this.
> Just what I mentioned in the prior email.
> > And if with Ivy Bridge (or Haswell) he situation has improved.
> Open MPI doesn't treat these chips any different.
> Jeff Squyres
> jsquyres_at_[hidden] <mailto:jsquyres_at_[hidden]>
> For corporate legal information go to:
> users mailing list
> users_at_[hidden] <mailto:users_at_[hidden]>
> users mailing list