The driver doesn't allocate much memory here. Maybe some small control buffers, but nothing significantly involved in large message transfer performance. Everything critical here is allocated by user-space (either MPI lib or application), so we just have to make sure we bind the process memory properly. I used hwloc-bind to do that.

Note that we have seen larger issues on older platforms. You basically just need a big HCA and PCI link on a not-so-big machine. Not very common fortunately with todays QPI links between Sandy-Bridge socket, those are quite big compared to PCI Gen3 8x links to the HCA. On old AMD platforms (and modern Intels with big GPUs), issues are not that uncommon (we've seen up to 40% DMA bandwidth difference there).

Brice



Le 08/07/2013 19:44, Michael Thomadakis a écrit :
Hi Brice, 

thanks for testing this out.

How did you make sure that the pinned pages used by the I/O adapter mapped to the "other" socket's memory controller ? Is pining the MPI binary to a socket sufficient to pin the space used for MPI I/O as well to that socket? I think this is something done by and at the HCA device driver level. 

Anyways, as long as the memory performance difference is a the levels you mentioned then there is no "big" issue. Most likely the device driver get space from the same numa domain that of the socket the HCA is attached to. 

Thanks for trying it out
Michael






On Mon, Jul 8, 2013 at 11:45 AM, Brice Goglin <Brice.Goglin@inria.fr> wrote:
On a dual E5 2650 machine with FDR cards, I see the IMB Pingpong throughput drop from 6000 to 5700MB/s when the memory isn't allocated on the right socket (and latency increases from 0.8 to 1.4us). Of course that's pingpong only, things will be worse on a memory-overloaded machine. But I don't expect things to be "less worse" if you do an intermediate copy through the memory near the HCA: you would overload the QPI link as much as here, and you would overload the CPU even more because of the additional copies.

Brice



Le 08/07/2013 18:27, Michael Thomadakis a écrit :
People have mentioned that they experience unexpected slow downs in PCIe_gen3 I/O when the pages map to a socket different from the one the HCA connects to. It is speculated that the inter-socket QPI is not provisioned to transfer more than 1GiB/sec for PCIe_gen 3 traffic. This situation may not be in effect on all SandyBrige or IvyBridge systems.

Have you measured anything like this on you systems as well? That would require using physical memory mapped to the socket w/o HCA exclusively for MPI messaging.

Mike


On Mon, Jul 8, 2013 at 10:52 AM, Jeff Squyres (jsquyres) <jsquyres@cisco.com> wrote:
On Jul 8, 2013, at 11:35 AM, Michael Thomadakis <drmichaelt7777@gmail.com> wrote:

> The issue is that when you read or write PCIe_gen 3 dat to a non-local NUMA memory, SandyBridge will use the inter-socket QPIs to get this data across to the other socket. I think there is considerable limitation in PCIe I/O traffic data going over the inter-socket QPI. One way to get around this is for reads to buffer all data into memory space local to the same socket and then transfer them by code across to the other socket's physical memory. For writes the same approach can be used with intermediary process copying data.

Sure, you'll cause congestion across the QPI network when you do non-local PCI reads/writes.  That's a given.

But I'm not aware of a hardware limitation on PCI-requested traffic across QPI (I could be wrong, of course -- I'm a software guy, not a hardware guy).  A simple test would be to bind an MPI process to a far NUMA node and run a simple MPI bandwidth test and see if to get better/same/worse bandwidth compared to binding an MPI process on a near NUMA socket.

But in terms of doing intermediate (pipelined) reads/writes to local NUMA memory before reading/writing to PCI, no, Open MPI does not do this.  Unless there is a PCI-QPI bandwidth constraint that we're unaware of, I'm not sure why you would do this -- it would likely add considerable complexity to the code and it would definitely lead to higher overall MPI latency.

Don't forget that the MPI paradigm is for the application to provide the send/receive buffer.  Meaning: MPI doesn't (always) control where the buffer is located (particularly for large messages).

> I was wondering if OpenMPI does anything special memory mapping to work around this.

Just what I mentioned in the prior email.

> And if with Ivy Bridge (or Haswell) he situation has improved.

Open MPI doesn't treat these chips any different.

--
Jeff Squyres
jsquyres@cisco.com
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/


_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users