On Nov 14, 2008, at 10:56 AM, Eugene Loh wrote:
>> I too am interested - I think we need to do something about the sm
>> backing file situation as larger core machines are slated to
>> become more prevalent shortly.
> I think there is at least one piece of low-flying fruit: get rid of
> a lot of the page alignments. Especially as one goes to large core
> counts, the O(n^2) number of local "connections" becomes important,
> and each connection starts with three page-aligned allocations, each
> allocation very tiny (and hence uses only a tiny portion of the page
> + that is allocated to it). So, most of the allocated memory is
> never used.
> Personally, I question the rationale for the page alignment in the
> first place, but don't mind listening to anyone who wants to explain
> it to me. Presumably, in a NUMA machine, localizing FIFOs to
> separate physical memory improves performance. I get that basic
> premise. I just question the reasoning beyond that.
I think the original rationale was that only pages could be physically
pinned (not cache lines).
Slightly modifying Eugene's low-hanging fruit might be to figure out
which processes are local to each other (e.g., on cores on the same
socket) where memory local to all the cores on a socket. These
processes' data could be shared contiguously (perhaps even within a
single page, depending on how many cores are there) instead of on
individual pages. Specifically: use page alignments for groups of
processes that have the same memory locality.
> The page alignment appears in ompi_fifo_init and ompi_cb_fifo_init.
> It comes additionally from mca_mpool_sm_alloc. Four minor changes
> could change alignment from page to cacheline size.
>> what happens when there isn't enough memory to support all this?
>> Are we smart enough to detect this situation? Does the sm
>> subsystem quietly shut down? Warn and shut down? Segfault?
> I'm not exactly sure. I think it's a combination of three things:
> *) some attempt to signal problems correctly
> *) some degree just to live with less shared memory (possibly
> leading to performance degradation)
> *) poorly tested in any case
>> I have two examples so far:
>> 1. using a ramdisk, /tmp was set to 10MB. OMPI was run on a single
>> node, 2ppn, with btl=openib,sm,self. The program started, but
>> segfaulted on the first MPI_Send. No warnings were printed.
>> 2. again with a ramdisk, /tmp was reportedly set to 16MB
>> (unverified - some uncertainty, could be have been much larger).
>> OMPI was run on multiple nodes, 16ppn, with btl=openib,sm,self.
>> The program ran to completion without errors or warning. I don't
>> know the communication pattern - could be no local comm was
>> performed, though that sounds doubtful.
> devel mailing list