Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Tim S. Woodall (twoodall_at_[hidden])
Date: 2005-08-11 11:08:04

Hello Gleb,

A couple of general comments:

We initially started by maintaining the cache at the btl/mpool level. However,
we needed to expose the registrations to the upper level (pml), to
allow the pml to make scheduling decisions (which btl/protocol to use),
so we re-organized this to maintain a global cache/tree, where a given
registration in the tree may reference multiple btls. This allows the pml
to do a single lookup, and optionally schedule the message on the set of
btls that have registered the memory.

That said, there are problems with the current approach as you've indicated.
MRU lists are maintained on a per-btl module basis (for fairness), which results
in a good bit of duplicated code across btls. Also, as you've indicated, the
current API/global cache (R/B tree) doesn't support overlapping registrations.

Additional comments inline:

Gleb Natapov wrote:
> Hello Tim,
> On Tue, Aug 09, 2005 at 10:22:34AM -0600, Timothy B. Prins wrote:
>>If you have anyother ideas of how to do it please let us know.
> I have to confess I don't like current pindown cache implementation much or
> perhaps I don't understand it enough.
> What I managed to understand from the code is this:
> There are three functions:
> int mca_mpool_base_insert(void * addr, size_t size,
> mca_mpool_base_module_t* mpool,
> void* user_data,
> mca_mpool_base_registration_t* registration);
> int mca_mpool_base_remove(void * base);
> mca_mpool_base_chunk_t* mca_mpool_base_find(void* base);
> When btl registers memory it inserts registration in global cache by calling
> mca_mpool_base_insert() this insertion may shadow registration of the same
> memory from another module or even from the same module.
> mca_mpool_base_remove() removes address from the cache, but there is no way
> module can guaranty that deleted registration belongs to the module calling
> remove.
> mca_mpool_base_find() returns first registration it encounter in the cache. The
> registration may not be the best (biggest) or it may belong to the wrong module
> (endpoint is not accessible through it).

This is true. We have discussed changing the API to accept the base address
and range - and return the entire set of overlapping registrations.

> Each btl should maintain it's own mru list, but the code is pretty much the same.

Agreed - this is ugly...

> The saddest thing is you can't override the interface in your module. It is too
> coupled with pml (ob1) and btls. If you don't like the way registration cache
> works the only way to fix it is rewrite pml/btl/mpool.

True. We could implement a new framework for the cache, to allow this to be replaced.
However, my preference is still to maintain a single cache/tree, to minimize latency/overhead
in doing lookups.

> I have some ideas about interface that I want to see, but perhaps it will not
> play nice with the way ob1 works now. And remember my view is IB centric and may
> be completely wrong for other interconnects. I will be glad to here your
> comments.
> I think cache should be implemented for each mpool and not single global one.
> Three function will be added to mca_mpool_base_module_t:
> mpool_insert(mca_mpool_base_module_t, mca_mpool_base_registration_t)
> mca_mpool_base_registration_t mpool_find(mca_mpool_base_module_t, void *addr, size_t size)
> mpool_put (mca_mpool_base_module_t, mca_mpool_base_registration_t);
> Each mpool can override those functions and provide its own cache implementation.
> But base implementation will provide default one. The cache will maintain it's
> own mru list.
> mca_mpool_base_find(void *addr, size_t length) will iterate through mpool list,
> will call mpool_find() for each of them and will return list of registration to
> pml. pml should call mpool_put() on registration it no longer needs (this is
> needed for proper reference counting).
> btl will call mpool_insert() after mpool_register() it is possible to merge these
> two functions in one.

My only issue with this is the cost of iterating over each of the mpools and doing
a lookup in each.

> I have code that manages overlapping registrations and I am porting it to
> openmpi now, but without changing the way mpool works it will be not very
> useful.

Could we implement this a single cache where each entry could reference multiple

Any thoughts/opions regarding a framework for a single cache?