Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] Possible OMPI 1.6.5 bug? SEGV in malloc.c
From: Christopher Samuel (samuel_at_[hidden])
Date: 2013-08-29 19:33:39

Hash: SHA1

Hi Jeff, Ralph,

On 29/08/13 23:30, Jeff Squyres (jsquyres) wrote:

> Let me try to understand this test:
> - you're simulating a 1GB memory limit via ulimit of virtual
> memory ("ulimit -v $((1*1024*1024))"), or 1,048,576 bytes.

Yeah, basically doing by hand what Torque/Slurm do by default for jobs
(unless the user asks for more).

When this happens for Dalton (compiled with the Intel compilers) it
just sits there spinning its wheels at start up.

> - you're trying to alloc 1070*10^6 = 1,070,000,000 bytes in an MPI
> app

That was the developer trying to simulate the failure in Dalton.

> - OMPI is barfing in the ptmalloc allocator

Sounds like it.

> Meaning: you're trying to allocate 1,000x memory than you're
> allowing in virtual memory -- so I guess part of this test depends
> on how much physical RAM you have, because you're limiting virtual
> memory, right?

No, it only depends on the memory limits for the job in Slurm.

The reason for the test is that he was trying to see whether or not
those limits were successfully being propagated to MPI ranks or not in
Slurm (and it appears not).

However, in the process he found he could also replicate this
livelock/deadlock in Dalton.

> It's quite possible that the ptmalloc included in OMPI doesn't
> guard well against a failed mmap. FWIW, I've seen all kinds of
> random badness (not just with OMPI) when malloc/mmap/etc. start
> failing due to lack of memory.

OK, so I'll try testing again with a larger limit to see if that will
ameliorate this issue. I'm also wondering where this is happening in
OMPI, I've a sneaking suspicion this is at MPI_INIT().

> Do you get the same behavior if you disable ptmalloc in OMPI?
> (your IB large message bandwidth will suffer a bit, though)

Not tried that, but I'll take a look at it if it doesn't seem possible
to fix it with a change to the default memory limits (that'll be the
least intrusive).

- --
 Christopher Samuel Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: samuel_at_[hidden] Phone: +61 (0)3 903 55545

Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird -