-----BEGIN PGP SIGNED MESSAGE-----
Hi Jeff, Ralph,
On 29/08/13 23:30, Jeff Squyres (jsquyres) wrote:
> Let me try to understand this test:
> - you're simulating a 1GB memory limit via ulimit of virtual
> memory ("ulimit -v $((1*1024*1024))"), or 1,048,576 bytes.
Yeah, basically doing by hand what Torque/Slurm do by default for jobs
(unless the user asks for more).
When this happens for Dalton (compiled with the Intel compilers) it
just sits there spinning its wheels at start up.
> - you're trying to alloc 1070*10^6 = 1,070,000,000 bytes in an MPI
That was the developer trying to simulate the failure in Dalton.
> - OMPI is barfing in the ptmalloc allocator
Sounds like it.
> Meaning: you're trying to allocate 1,000x memory than you're
> allowing in virtual memory -- so I guess part of this test depends
> on how much physical RAM you have, because you're limiting virtual
> memory, right?
No, it only depends on the memory limits for the job in Slurm.
The reason for the test is that he was trying to see whether or not
those limits were successfully being propagated to MPI ranks or not in
Slurm (and it appears not).
However, in the process he found he could also replicate this
livelock/deadlock in Dalton.
> It's quite possible that the ptmalloc included in OMPI doesn't
> guard well against a failed mmap. FWIW, I've seen all kinds of
> random badness (not just with OMPI) when malloc/mmap/etc. start
> failing due to lack of memory.
OK, so I'll try testing again with a larger limit to see if that will
ameliorate this issue. I'm also wondering where this is happening in
OMPI, I've a sneaking suspicion this is at MPI_INIT().
> Do you get the same behavior if you disable ptmalloc in OMPI?
> (your IB large message bandwidth will suffer a bit, though)
Not tried that, but I'll take a look at it if it doesn't seem possible
to fix it with a change to the default memory limits (that'll be the
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: samuel_at_[hidden] Phone: +61 (0)3 903 55545
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
-----END PGP SIGNATURE-----