Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Question on handling of memory for communications
From: Brice Goglin (Brice.Goglin_at_[hidden])
Date: 2013-07-08 12:45:00

On a dual E5 2650 machine with FDR cards, I see the IMB Pingpong
throughput drop from 6000 to 5700MB/s when the memory isn't allocated on
the right socket (and latency increases from 0.8 to 1.4us). Of course
that's pingpong only, things will be worse on a memory-overloaded
machine. But I don't expect things to be "less worse" if you do an
intermediate copy through the memory near the HCA: you would overload
the QPI link as much as here, and you would overload the CPU even more
because of the additional copies.


Le 08/07/2013 18:27, Michael Thomadakis a écrit :
> People have mentioned that they experience unexpected slow downs in
> PCIe_gen3 I/O when the pages map to a socket different from the one
> the HCA connects to. It is speculated that the inter-socket QPI is not
> provisioned to transfer more than 1GiB/sec for PCIe_gen 3 traffic.
> This situation may not be in effect on all SandyBrige or IvyBridge
> systems.
> Have you measured anything like this on you systems as well? That
> would require using physical memory mapped to the socket w/o HCA
> exclusively for MPI messaging.
> Mike
> On Mon, Jul 8, 2013 at 10:52 AM, Jeff Squyres (jsquyres)
> <jsquyres_at_[hidden] <mailto:jsquyres_at_[hidden]>> wrote:
> On Jul 8, 2013, at 11:35 AM, Michael Thomadakis
> <drmichaelt7777_at_[hidden] <mailto:drmichaelt7777_at_[hidden]>> wrote:
> > The issue is that when you read or write PCIe_gen 3 dat to a
> non-local NUMA memory, SandyBridge will use the inter-socket QPIs
> to get this data across to the other socket. I think there is
> considerable limitation in PCIe I/O traffic data going over the
> inter-socket QPI. One way to get around this is for reads to
> buffer all data into memory space local to the same socket and
> then transfer them by code across to the other socket's physical
> memory. For writes the same approach can be used with intermediary
> process copying data.
> Sure, you'll cause congestion across the QPI network when you do
> non-local PCI reads/writes. That's a given.
> But I'm not aware of a hardware limitation on PCI-requested
> traffic across QPI (I could be wrong, of course -- I'm a software
> guy, not a hardware guy). A simple test would be to bind an MPI
> process to a far NUMA node and run a simple MPI bandwidth test and
> see if to get better/same/worse bandwidth compared to binding an
> MPI process on a near NUMA socket.
> But in terms of doing intermediate (pipelined) reads/writes to
> local NUMA memory before reading/writing to PCI, no, Open MPI does
> not do this. Unless there is a PCI-QPI bandwidth constraint that
> we're unaware of, I'm not sure why you would do this -- it would
> likely add considerable complexity to the code and it would
> definitely lead to higher overall MPI latency.
> Don't forget that the MPI paradigm is for the application to
> provide the send/receive buffer. Meaning: MPI doesn't (always)
> control where the buffer is located (particularly for large messages).
> > I was wondering if OpenMPI does anything special memory mapping
> to work around this.
> Just what I mentioned in the prior email.
> > And if with Ivy Bridge (or Haswell) he situation has improved.
> Open MPI doesn't treat these chips any different.
> --
> Jeff Squyres
> jsquyres_at_[hidden] <mailto:jsquyres_at_[hidden]>
> For corporate legal information go to:
> _______________________________________________
> users mailing list
> users_at_[hidden] <mailto:users_at_[hidden]>
> _______________________________________________
> users mailing list
> users_at_[hidden]