George Bosilca wrote:
> Then it looks like the safest solution is the use either ftruncate or
> the lseek method and then touch the first byte of all memory pages.
> Unfortunately, I see two problems with this. First, there is a clear
> performance hit on the startup time. And second, we will have to find
> a pretty smart way to do this or we will completely break the memory
> affinity stuff.
We're basically touching all the pages on start-up anyhow.
Let me explain.
The sm BTL needs to set up a shared/mmap file to accommodate what's
needed at MPI_Init time and how much space you'll want for growing
during the course of the run. We used to size this file "arbitrarily"
(mpool_sm_per_peer_size and mpool_sm_[min|max]_size), which allocated
shared memory excessively for small jobs but insufficiently (won't start
up) for big jobs. As part of moving to the single-queue model, I tried
to size the shared memory more reasonably -- at a minimu, so that jobs
would start up. The current formula is to estimate how much memory will
be needed at MPI_Init time and set the file for that size. We can argue
about whether or not headroom should be included, but currently (1.3.2)
none is really provided.
So, the shared area is basically filled up during MPI_Init(). For large
np, most of that space is eager fragments. An eager fragment in the
shared area includes a pointer back to the free list that manages that
fragment. Those pointers have to be initialized. Since eager fragments
by default are 4K, it turns out that basically every page is touched
during MPI_Init(). (Fine print: not true of the max fragments, but
there aren't very many of those.)