Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Tim S. Woodall (twoodall_at_[hidden])
Date: 2005-08-16 11:47:52


Hello Gleb,

After thinking about this a bit more - I agree w/ your approach. We're
going to look at moving the cache/mru list into each mpool, and maintaining
the registrations on a per btl basis. This should actually clean up
a lot of code.

We're working on an approach to this now, that I think will address what
you are looking for. Will let you know when we have something a little more
concrete.

Tim

Gleb Natapov wrote:
> Hello Tim,
>
> On Thu, Aug 11, 2005 at 10:08:04AM -0600, Tim S. Woodall wrote:
>
>>Hello Gleb,
>>
>>A couple of general comments:
>>
>>We initially started by maintaining the cache at the btl/mpool level. However,
>>we needed to expose the registrations to the upper level (pml), to
>>allow the pml to make scheduling decisions (which btl/protocol to use),
>>so we re-organized this to maintain a global cache/tree, where a given
>>registration in the tree may reference multiple btls. This allows the pml
>>to do a single lookup, and optionally schedule the message on the set of
>>btls that have registered the memory.
>>
>
> I understand the need to expose the registration cache to pml and I
> think the optimisations you are using this for is very neat idea, but as far
> as I see the current code returns multiple btls for the registration only if
> memory was registered using mca_mpool_base_alloc() (the rare case IMHO).
>
>
>>>The saddest thing is you can't override the interface in your module. It is too
>>>coupled with pml (ob1) and btls. If you don't like the way registration cache
>>>works the only way to fix it is rewrite pml/btl/mpool.
>>>
>>
>>True. We could implement a new framework for the cache, to allow this to be replaced.
>>However, my preference is still to maintain a single cache/tree, to minimize latency/overhead
>>in doing lookups.
>
> I think new framework for the cache would be grate and will fit openmpi
> philosophy nicely (if something may be done different do it modular) :).
> Regarding lookup latency I think it is not so obvious that the latency will
> suffer (your datastructures will be more complex for instance), and think
> about other optimisation you can do if cache is per btl, like searching cache only
> for btls from endpoint->btl_rdma for instance.
>
>>>I have some ideas about interface that I want to see, but perhaps it will not
>>>play nice with the way ob1 works now. And remember my view is IB centric and may
>>>be completely wrong for other interconnects. I will be glad to here your
>>>comments.
>>>
>>>I think cache should be implemented for each mpool and not single global one.
>>>
>>>Three function will be added to mca_mpool_base_module_t:
>>>mpool_insert(mca_mpool_base_module_t, mca_mpool_base_registration_t)
>>>mca_mpool_base_registration_t mpool_find(mca_mpool_base_module_t, void *addr, size_t size)
>>>mpool_put (mca_mpool_base_module_t, mca_mpool_base_registration_t);
>>>
>>>Each mpool can override those functions and provide its own cache implementation.
>>>But base implementation will provide default one. The cache will maintain it's
>>>own mru list.
>>>
>>>mca_mpool_base_find(void *addr, size_t length) will iterate through mpool list,
>>>will call mpool_find() for each of them and will return list of registration to
>>>pml. pml should call mpool_put() on registration it no longer needs (this is
>>>needed for proper reference counting).
>>>
>>>btl will call mpool_insert() after mpool_register() it is possible to merge these
>>>two functions in one.
>>>
>>
>>My only issue with this is the cost of iterating over each of the mpools and doing
>>a lookup in each.
>
> What about optimisation I mentioned (iterating only over endpoint->btl_rdma).
> Also I looked more closely on all usages of mca_mpool_base_find() (there are not many of
> those) and I have a question:
>
> Lets say application wants to send buffer B to 100 other ranks. It does
> 100 MPI_Isend (B). OpenMPI desides to use rdma to send B and calls
> mca_pml_ob1_send_request_start_prepare() for each sendreq.
> The following is called for each sendreq:
> ...
> sendreq->req_chunk = mca_mpool_base_find(sendreq->req_send.req_addr);
> ...
> Note that req_chunk is cached in sendreq until call to
> mca_pml_ob1_send_request_put() (lets say B was not yet registered so
> sendreq->req_chunk is NULL for all 100 requests).
> When mca_pml_ob1_send_request_put() is called on first sendreq sometime later, it
> finds NULL in sendreq->req_chunk and subsequent call to mca_bml_base_prepare_src()
> registers B, but all other 99 requests have NULL in sendreq->req_chunk
> and they will register same buffer 99 times more!
>
> Is this scenario possible or I missed something?
>
> If it is possible then it looks like the right thing to do is to search
> cache inside mca_pml_ob1_send_request_put() and since btl is available as parameter
> to this function we can search only the btls cache.
>
>
>>
>>>I have code that manages overlapping registrations and I am porting it to
>>>openmpi now, but without changing the way mpool works it will be not very
>>>useful.
>>>
>>
>>Could we implement this a single cache where each entry could reference multiple
>>mpools/btls?
>>
>
> Yes this is possible with performance hit.
>
> My cache is slightly reminds linux virtual memory area (VMA) list.
>
> The registration structure looks like this:
> struct reg
> {
> int begin; /* start of registration */
> int end; /* end of registration */
> };
>
> The VMA structure looks like this
> struct vma
> {
> int begin; /* first page in the VMA */
> int end; /* last page in the VMA */
> list reg_list /* list of registration in this VMA sorted by reg->end */
> };
>
> VMAs are never overlap. reg_list of each VMA contains sorted list of all
> registrations that cover the entire VMA.
> For instance if we have two registrations first (R1) from 50 to 150 and second
> (R2) from 100 to 200 the cache will have three VMAs. First one will cover area
> from 50 to 99 and will have R1 in reg_list, second will cover area from 100 to
> 150 and will have R2 and R1 in reg_list, third will cover area from 151 to 200
> and will have R2 in reg_list.
>
> All VMAs are stored in the R/B tree and in the sorted list. Insertion time is
> linear, search is logarithmic. The search function gets address and size as
> parameters, it looks for VMA holding the address and check if the first
> registration on the list is big enough to hold size bytes (note that the
> reg_list is sorted so it is enough to check only the first element on the list).
>
> I can hold registrations from different mpools in the same database, but in this
> case search functions will be less efficient since it will need to scan whole
> reg_list to find registration for each mpool and in most cases caller doesn't
> even need all the list, but only the registration for specific btl!
>
> --
> Gleb.
>