Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] allocating sm memory with page alignment
From: Lenny Verkhovsky (lenny.verkhovsky_at_[hidden])
Date: 2008-09-08 08:41:36


Actually I will be interested in this discussion.

On 9/5/08, Jeff Squyres <jsquyres_at_[hidden]> wrote:
>
> For the mailing list...
>
> Note that we moved this conversation to a higher bandwidth (telephone). If
> others are interested, please let us know.
>
>
> On Sep 3, 2008, at 1:21 AM, Eugene Loh wrote:
>
> Jeff Squyres wrote:
>>
>> I think even first-touch will make *the whole page* be local to the
>>> process that touches it.
>>>
>>
>> Right.
>>
>> So if you have each process take N bytes (where N << page_size), then
>>> the 0th process will make that whole page be local; it may be remote for
>>> others.
>>>
>>
>> I think I'm not making myself clear. Read on...
>>
>> *) You wouldn't need to control memory allocations with a lock (except
>>>> for multithreaded apps). I haven't looked at this too closely yet, but the
>>>> 3*n*n memory allocations in shared memory during MPI_Init are currently
>>>> serialized, which sounds disturbing when n is 100 to 500 local processes.
>>>>
>>>
>>> If I'm understanding your proposal right, you're saying that each
>>> process would create its own shared memory space, right? Then any other
>>> process that wants to send to that process would mmap/shmattach/ whatever to
>>> the receiver's shared memory space. Right?
>>>
>>
>> I don't think it's necessary to have each process have its own segment.
>> The OS manages the shared area on a per-page basis anyhow. All that's
>> necessary is that there is an agreement up front about which pages will be
>> local to which process. E.g., if there are P processes/processors and the
>> shared area has M pages per process, then there will be P*M pages
>> altogether. We'll say that the first M pages are local to process 0, then
>> next m to process 1, etc. That is, process 0 will first-touch the first M
>> pages, process 1 will first-touch the next M pages, etc. If an allocation
>> needs to be local to process i, then process i will allocate it from its
>> pages. Since only process i can allocate from these pages, it does not need
>> any lock protection to keep other processes from allocating at the same
>> time. And, since these pages have the proper locality, then small
>> allocations can all share common pages (instead of having a separate page
>> for each 12-byte or 64-byte allocation).
>>
>> Clearer? One shared memory region, partitioned equally among all
>> processes. Each process first-touches its own pages to get the right
>> locality. Each allocation made by the process to whom it should be local.
>> Benefits include no multi-process locks and no need for page alignment of
>> tiny allocations.
>>
>> The total amount of shared memory will likely not go down, because the
>>> OS will still likely allocate on a per-page basis, right?
>>>
>>
>> Total amount would go down significantly. Today, if you want to allocate
>> 64 bytes on a page boundary, you allocate 64+pagesize, a 100x overhead.
>> With what I'm (evidently not so clearly) proposing is that we establish a
>> policy about what memory will be local to whom. With that policy, we simply
>> allocate our 64 bytes in the appropriate region. This eliminates the need
>> for page alignment (page is already in the right place, shared by many
>> allocations all of whom want to be there). You could still want cacheline
>> alignment... that's fine.
>>
>> But per your 2nd point, would the resources required for each process to
>>> mmap/ shmattach/whatever 511 other process' shared memory spaces be
>>> prohibitive?
>>>
>>
>> No need to have more shared memory segments. Just need a policy to say
>> how your global space is partitioned.
>>
>> Graham, Richard L. wrote:
>>>>
>>>> I have not looked at the code in a long time, so not sure how many
>>>>> things have changed ... In general what you are suggesting is reasonable.
>>>>> However, especially on large machines you also need to worry about memory
>>>>> locality, so should allocate from memory pools that are appropriately
>>>>> located. I expect that memory allocated on a per-socket basis would do.
>>>>>
>>>>
>>>> Is this what "maffinity" and "memory nodes" are about? If so, I would
>>>> think memory locality should be handled there rather than in page alignment
>>>> of individual 12-byte and 64-byte allocations.
>>>>
>>>
>>> maffinity was a first stab at memory affinity and is currently (and has
>>> been for a long, long time) no frills and didn't have a lot of thought put
>>> into it.
>>>
>>> I see the "node id" and "bind" functions in there; I think Gleb must
>>> have added them somewhere along the way. I'm not sure how much thought
>>> was put into making those be truly generic functions (I see them
>>> implemented in libnuma, which AFAIK is Linux-specific). Does Solaris have
>>> memory affinity function calls?
>>>
>>
>> Yes, I believe so, though perhaps I don't understand your question.
>>
>> Things like mbind() and numa_setlocal_memory() are, I assume, Linux calls
>> for placing some memory close to a process. I think the Solaris madvise()
>> call does this: give a memory range and say something about how that memory
>> should be placed -- e.g., the memory should be placed local to the next
>> thread to touch that memory. Anyhow, I think the default policy is "first
>> touch", so one could always do that.
>>
>> I'm not an expert on this stuff, but I just wanted to reassure you that
>> Solaris supports NUMA programming. There are interfaces for discovering the
>> NUMA topology of a machine (there is a hierarchy of "locality groups", each
>> containing CPUs and memory), for discovering in which locality group you
>> are, for advising the VM system where you want memory placed, and for
>> querying where certain memory is. I could do more homework on these matters
>> if it'd be helpful.
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>