For the mailing list...
Note that we moved this conversation to a higher bandwidth
(telephone). If others are interested, please let us know.
On Sep 3, 2008, at 1:21 AM, Eugene Loh wrote:
> Jeff Squyres wrote:
>> I think even first-touch will make *the whole page* be local to
>> the process that touches it.
>> So if you have each process take N bytes (where N << page_size),
>> then the 0th process will make that whole page be local; it may be
>> remote for others.
> I think I'm not making myself clear. Read on...
>>> *) You wouldn't need to control memory allocations with a lock
>>> (except for multithreaded apps). I haven't looked at this too
>>> closely yet, but the 3*n*n memory allocations in shared memory
>>> during MPI_Init are currently serialized, which sounds disturbing
>>> when n is 100 to 500 local processes.
>> If I'm understanding your proposal right, you're saying that each
>> process would create its own shared memory space, right? Then any
>> other process that wants to send to that process would mmap/
>> shmattach/ whatever to the receiver's shared memory space. Right?
> I don't think it's necessary to have each process have its own
> segment. The OS manages the shared area on a per-page basis
> anyhow. All that's necessary is that there is an agreement up front
> about which pages will be local to which process. E.g., if there
> are P processes/processors and the shared area has M pages per
> process, then there will be P*M pages altogether. We'll say that
> the first M pages are local to process 0, then next m to process 1,
> etc. That is, process 0 will first-touch the first M pages, process
> 1 will first-touch the next M pages, etc. If an allocation needs to
> be local to process i, then process i will allocate it from its
> pages. Since only process i can allocate from these pages, it does
> not need any lock protection to keep other processes from allocating
> at the same time. And, since these pages have the proper locality,
> then small allocations can all share common pages (instead of having
> a separate page for each 12-byte or 64-byte allocation).
> Clearer? One shared memory region, partitioned equally among all
> processes. Each process first-touches its own pages to get the
> right locality. Each allocation made by the process to whom it
> should be local. Benefits include no multi-process locks and no
> need for page alignment of tiny allocations.
>> The total amount of shared memory will likely not go down, because
>> the OS will still likely allocate on a per-page basis, right?
> Total amount would go down significantly. Today, if you want to
> allocate 64 bytes on a page boundary, you allocate 64+pagesize, a
> 100x overhead. With what I'm (evidently not so clearly) proposing
> is that we establish a policy about what memory will be local to
> whom. With that policy, we simply allocate our 64 bytes in the
> appropriate region. This eliminates the need for page alignment
> (page is already in the right place, shared by many allocations all
> of whom want to be there). You could still want cacheline
> alignment... that's fine.
>> But per your 2nd point, would the resources required for each
>> process to mmap/ shmattach/whatever 511 other process' shared
>> memory spaces be prohibitive?
> No need to have more shared memory segments. Just need a policy to
> say how your global space is partitioned.
>>> Graham, Richard L. wrote:
>>>> I have not looked at the code in a long time, so not sure how
>>>> many things have changed ... In general what you are suggesting
>>>> is reasonable. However, especially on large machines you also
>>>> need to worry about memory locality, so should allocate from
>>>> memory pools that are appropriately located. I expect that
>>>> memory allocated on a per-socket basis would do.
>>> Is this what "maffinity" and "memory nodes" are about? If so, I
>>> would think memory locality should be handled there rather than
>>> in page alignment of individual 12-byte and 64-byte allocations.
>> maffinity was a first stab at memory affinity and is currently
>> (and has been for a long, long time) no frills and didn't have a
>> lot of thought put into it.
>> I see the "node id" and "bind" functions in there; I think Gleb
>> must have added them somewhere along the way. I'm not sure how
>> much thought was put into making those be truly generic functions
>> (I see them implemented in libnuma, which AFAIK is Linux-
>> specific). Does Solaris have memory affinity function calls?
> Yes, I believe so, though perhaps I don't understand your question.
> Things like mbind() and numa_setlocal_memory() are, I assume, Linux
> calls for placing some memory close to a process. I think the
> Solaris madvise() call does this: give a memory range and say
> something about how that memory should be placed -- e.g., the memory
> should be placed local to the next thread to touch that memory.
> Anyhow, I think the default policy is "first touch", so one could
> always do that.
> I'm not an expert on this stuff, but I just wanted to reassure you
> that Solaris supports NUMA programming. There are interfaces for
> discovering the NUMA topology of a machine (there is a hierarchy of
> "locality groups", each containing CPUs and memory), for discovering
> in which locality group you are, for advising the VM system where
> you want memory placed, and for querying where certain memory is. I
> could do more homework on these matters if it'd be helpful.
> devel mailing list