On 7/18/13 7:39 PM, "Ralph Castain" <rhc@open-mpi.org> wrote:

We are looking at exascale requirements, and one of the big issues is memory footprint. We currently retrieve the endpoint info for every process in the job, plus all the procs in any communicator with which we do a connect/accept - even though we probably will only communicate with a small number of them. This wastes a lot of memory at scale.

As long as we are re-working the endpoint stuff, would it be a thought to go ahead and change how we handle the above? I'm looking to switch to a lazy definition approach where we compute endpoints for procs on first-message instead of during mpi_init, retrieving the endpoint info for that proc only at that time. So instead of storing all the endpoint info for every proc in each proc, each proc only would contain the info it requires for that application.

It depends on what you mean by endpoint information.  If you mean what I call endpoint information (the stuff the PML/MTL/BML stores on an ompi_proc_t), then I really don't care.  For Portals, the endpoint information is quite small (8-16 bytes, depending on addressing mode), so I'd rather pre-populate the array and not slow down the send path with yet another conditional than have to check for endpoint data.  Of course, given the Portals usage model, I'd really like to jam the endpoint data into shared memory at some point (not this patch).  If others want to figure out how to do lazy endpoint data setup for their network, I think that's reasonable.

Ideally, I'd like to see that extended to the ompi_proc_t array itself - maybe changing it to a sparse array/list of some type, so we only create that storage for procs we actually communicate to.

This would actually break a whole lot of things in OMPI and is a huge change.  However, I still have plans to add a --enable-minimal-memory type option some day which will make the ompi_proc_t significantly smaller by assuming homogeneous convertors and that you can programmatically get a remote host name when needed.  Again, unless we need to get micro-small (and I don't think we do), the sparseness requires conditionals in the critical path that worry me.

If you'd prefer to discuss this as a separate issue, that's fine - just something we need to work on at some point in the next year or two.

I agree some work is needed, but I think it's orthogonal to this issue and is something we're going to need to study in detail.  There are a number of space/time tradeoffs in that path.  Which isn't a problem, but there's a whole lot of low hanging fruit before we get to the hard stuff.  Now if you want the OFED interfaces to run at exascale, well, buy lots of memory.


  Brian W. Barrett
  Scalable System Software Group
  Sandia National Laboratories