Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Possible OMPI 1.6.5 bug? SEGV in malloc.c
From: Christopher Samuel (samuel_at_[hidden])
Date: 2013-08-29 19:33:39


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Jeff, Ralph,

On 29/08/13 23:30, Jeff Squyres (jsquyres) wrote:

> Let me try to understand this test:
>
> - you're simulating a 1GB memory limit via ulimit of virtual
> memory ("ulimit -v $((1*1024*1024))"), or 1,048,576 bytes.

Yeah, basically doing by hand what Torque/Slurm do by default for jobs
(unless the user asks for more).

When this happens for Dalton (compiled with the Intel compilers) it
just sits there spinning its wheels at start up.

> - you're trying to alloc 1070*10^6 = 1,070,000,000 bytes in an MPI
> app

That was the developer trying to simulate the failure in Dalton.

> - OMPI is barfing in the ptmalloc allocator

Sounds like it.

> Meaning: you're trying to allocate 1,000x memory than you're
> allowing in virtual memory -- so I guess part of this test depends
> on how much physical RAM you have, because you're limiting virtual
> memory, right?

No, it only depends on the memory limits for the job in Slurm.

The reason for the test is that he was trying to see whether or not
those limits were successfully being propagated to MPI ranks or not in
Slurm (and it appears not).

However, in the process he found he could also replicate this
livelock/deadlock in Dalton.

> It's quite possible that the ptmalloc included in OMPI doesn't
> guard well against a failed mmap. FWIW, I've seen all kinds of
> random badness (not just with OMPI) when malloc/mmap/etc. start
> failing due to lack of memory.

OK, so I'll try testing again with a larger limit to see if that will
ameliorate this issue. I'm also wondering where this is happening in
OMPI, I've a sneaking suspicion this is at MPI_INIT().

> Do you get the same behavior if you disable ptmalloc in OMPI?
> (your IB large message bandwidth will suffer a bit, though)

Not tried that, but I'll take a look at it if it doesn't seem possible
to fix it with a change to the default memory limits (that'll be the
least intrusive).

Thanks!
Chris
- --
 Christopher Samuel Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: samuel_at_[hidden] Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/ http://twitter.com/vlsci

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlIf2lMACgkQO2KABBYQAh/JrACfRKATdmD3hbSX0mHWtAt2cBP6
1wYAn31EjuS37inIaD151n1DxuAH4GAM
=yaYe
-----END PGP SIGNATURE-----