Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] SM backing file size
From: Eugene Loh (Eugene.Loh_at_[hidden])
Date: 2008-11-14 10:56:05


Ralph Castain wrote:

> I too am interested - I think we need to do something about the sm
> backing file situation as larger core machines are slated to become
> more prevalent shortly.

I think there is at least one piece of low-flying fruit: get rid of a
lot of the page alignments. Especially as one goes to large core
counts, the O(n^2) number of local "connections" becomes important, and
each connection starts with three page-aligned allocations, each
allocation very tiny (and hence uses only a tiny portion of the page+
that is allocated to it). So, most of the allocated memory is never used.

Personally, I question the rationale for the page alignment in the first
place, but don't mind listening to anyone who wants to explain it to
me. Presumably, in a NUMA machine, localizing FIFOs to separate
physical memory improves performance. I get that basic premise. I just
question the reasoning beyond that.

The page alignment appears in ompi_fifo_init and ompi_cb_fifo_init. It
comes additionally from mca_mpool_sm_alloc. Four minor changes could
change alignment from page to cacheline size.

> what happens when there isn't enough memory to support all this? Are
> we smart enough to detect this situation? Does the sm subsystem
> quietly shut down? Warn and shut down? Segfault?

I'm not exactly sure. I think it's a combination of three things:

*) some attempt to signal problems correctly
*) some degree just to live with less shared memory (possibly leading to
performance degradation)
*) poorly tested in any case

>
> I have two examples so far:
>
> 1. using a ramdisk, /tmp was set to 10MB. OMPI was run on a single
> node, 2ppn, with btl=openib,sm,self. The program started, but
> segfaulted on the first MPI_Send. No warnings were printed.
>
> 2. again with a ramdisk, /tmp was reportedly set to 16MB (unverified
> - some uncertainty, could be have been much larger). OMPI was run on
> multiple nodes, 16ppn, with btl=openib,sm,self. The program ran to
> completion without errors or warning. I don't know the communication
> pattern - could be no local comm was performed, though that sounds
> doubtful.