Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Brian Barrett (brbarret_at_[hidden])
Date: 2005-12-07 10:40:51

On Dec 7, 2005, at 9:44 AM, Gleb Natapov wrote:

> On Tue, Dec 06, 2005 at 11:07:44AM -0500, Brian Barrett wrote:
>> On Dec 6, 2005, at 10:53 AM, Gleb Natapov wrote:
>>> On Tue, Dec 06, 2005 at 08:33:32AM -0700, Tim S. Woodall wrote:
>>>>> Also memfree hooks decrease cache efficiency, the better solution
>>>>> would
>>>>> be to catch brk() system calls and remove memory from cache only
>>>>> then,
>>>>> but there is no way to do it for now.
>>>> We are look at other options, including catching brk/munmap system
>>>> calls, and
>>>> will be experimenting w/ these on the trunk.
>>> This will be really interesting. How are you going to catch brk/
>>> munmap
>>> without kernel help? Last time I checked preload tricks don't
>>> work if
>>> syscall is done from inside libc itself.
>> All of the tricks we are looking at assume that nothing in libc calls
>> munmap.
> glibc does call mmap/munmap internally for big allocations as
> strace of
> this program shows:
> int main ()
> {
> void *p = malloc (1024*1024);
> free (p);
> }

Ah, yes, I wasn't clear. On Linux, we actually ship our own version
of ptmalloc2 (the allocator used by glibc on Linux). We use the
standard linker search order tricks to have the linker choose our
versions of malloc, calloc, realloc, valloc, and free, which are from
ptmalloc2. We've modified our version of ptmalloc2 such that any
time it calls mmap or sbrk with a positive number, it then
immediately allows the cache to know about the allocation. Any time
it's about to call munmap or sbrk with a negative number, it informs
the cache code before giving the memory back to the OS. We also
catch mmap and munmap so that we can track when the user calls mmap /
munmap. Note that we play with ptmalloc2's code such that it calls
our mmap (which either uses the syscall interface directly or calls
__mmap depending on what the system supports), so we don't intercept
that call to mmap twice or anything like that.

This works pretty well (like I said - it's worked fine for LAM and
MPICH-gm for years), but has the problem of requiring the user to
either use the wrapper compilers or add the -lmpi -lorte -lopal to
the link line (ie, can't use shared library dependencies to load in or our ptmalloc2 / mmap / munmap isn't used. We can
detect that this happened pretty easily and then we fall back to the
pipelined RDMA code that doesn't offer the same performance but also
doesn't have a pinning problem.

>> We can successfully catch free() calls from inside libc
>> without any problems. The LAM/MPI team and Myricom (with MPICH-gm)
>> have been doing this for many years without any problems. On the
>> small percentage of MPI applications that require some linker tricks
>> (some of the commercial apps are this way), we won't be able to
>> intercept any free/munmap calls, so we're going to fall back to our
>> RDMA pipeline algorithm.
> Yes, but catching free is not good enough. This way we sometimes evict
> cache entries that may safely remains in the cache. Ideally we
> should be
> able to catch events that return memory to OS (munmap/brk) and
> remove the
> memory from cache only then.

This is essentially what we do on Linux - we only tell the rcache
code about allocations / deallocations when we are talking about
getting memory from or giving memory back to the operating system.

On Mac OS X / Darwin, due to their two level namespaces, we can't
replace malloc / free with a customized version of the Darwin
allocator like we could with ptmalloc2. There are some things you
can do to simulate such behavior, but it requires linking in a flat
namespace and doing some other things that nearly the Darwin
engineers to pass out when I was talking to them about said tricks.
So instead, we use the Darwin hooks for catching malloc / free /
etc. It's not optimal, but it's the best we can do in the
situation. And it doesn't force us to link all OMPI applications in
a flat namespace, which is always nice. Of course, we still
intercept mmap / munmap in the tradition linker tricks style. But
again, there are very few function calls in libSystem.dylib that call
mmap that we care about (malloc / free are already taken care of by
the standard hooks), so this doesn't cause a problem.

Hopefully this made some sense. If not, on to the next round of e-
mails :).


   Brian Barrett
   Open MPI developer