Subject: Re: [OMPI devel] [OMPI users] Memory manager
From: Terry Frankcombe (terry_at_[hidden])
Date: 2007-11-27 18:13:29

Hi Jeff

> > I posted this to the devel list the other day, but it raised no
> > responses. Maybe people will have more to say here.
> Sorry Terry; many of us were at the SC conference last week, and this
> week is short because of the US holiday. Some of the inbox got
> dropped/delayed as a result...

'Tis OK. Beggars can't be choosers! ;-)


> > Because of this I can't reduce the problem to a small testcase, and so
> > have not included any code at this stage.
> Ugh. Heisenbugs are the worst.
> Have you tried with a memory checking debugger, such as valgrind, or a
> parallel debugger? Is there a chance that there's a simple errant
> posted receive (perhaps in a race condition) that is unexpectedly
> receiving into the Bug's memory location when you don't expect it?

I have zero experience with valgrind. But I downloaded it and ran my
"minimal" case (about 1000 lines + libraries!) with it. Thus I found
one uninitialised variable and need to go away and check my code
carefully now. Correcting this in the most obvious, un-thought-through
way makes my Bug go away. (But then so does changing the code in other,
unexecuted sections!)

However, what I get out of valgrind now is:

[tjf_at_fkpc167 Minimal]$ valgrind --leak-check=yes ./nnh
==20671== Memcheck, a memory error detector.
==20671== Copyright (C) 2002-2007, and GNU GPL'd, by Julian Seward et
==20671== Using LibVEX rev 1732, a library for dynamic binary
==20671== Copyright (C) 2004-2007, and GNU GPL'd, by OpenWorks LLP.
==20671== Using valgrind-3.2.3, a dynamic binary instrumentation
==20671== Copyright (C) 2000-2007, and GNU GPL'd, by Julian Seward et
==20671== For more details, rerun with: -v
==20671== Conditional jump or move depends on uninitialised value(s)
==20671== at 0x40152B1: (within /lib/
==20671== by 0x4005278: (within /lib/
==20671== by 0x4007CFD: (within /lib/
==20671== by 0x400318A: (within /lib/
==20671== by 0x4013D9A: (within /lib/
==20671== by 0x40012C6: (within /lib/
==20671== by 0x4000A67: (within /lib/


==20671== Conditional jump or move depends on uninitialised value(s)
==20671== at 0x40152B1: (within /lib/
==20671== by 0x400A289: (within /lib/
==20671== by 0x6A42E4D: (within /lib/
==20671== by 0x59AE0E3: (within /lib/
==20671== by 0x400D725: (within /lib/
==20671== by 0x59AE4EC: (within /lib/
==20671== by 0x59AE099: dlsym (in /lib/
==20671== by 0x57610FB: vm_sym
(in /usr/local/lib/
==20671== by 0x575E29E: lt_dlsym
(in /usr/local/lib/
==20671== by 0x57666EF: open_component
(in /usr/local/lib/
==20671== by 0x576711B: mca_base_component_find
(in /usr/local/lib/
==20671== by 0x5767A9F: mca_base_components_open
(in /usr/local/lib/


<my code output, no valgrind errors within it>

==20671== ERROR SUMMARY: 102 errors from 24 contexts (suppressed: 0 from
==20671== malloc/free: in use at exit: 0 bytes in 0 blocks.
==20671== malloc/free: 0 allocs, 0 frees, 0 bytes allocated.
==20671== For counts of detected errors, rerun with: -v
==20671== All heap blocks were freed -- no leaks are possible.

This looks particularly broken!

I've just run valgrind on another (serial) piece of code on this machine
and got three of the unitialised jumps from within, virtually
identical to the first three from this MPI code. Of the 24 from the MPI
code, those seeming to originate from within OpenMPI are particularly

Am I panicking for no reason, have I likely got a bad build or is
OpenMPI broken beyond repair?!!

> > If I run the code with mpirun -np 1 the problem goes away. So one
> > could
> > presumably simply say "always run it with mpirun." But if this is
> > required, why does OpenMPI not detect it?
> I'm not sure what you're asking -- Open MPI does not *require* you to
> run with mpirun...

That's exactly what I was asking. Cheers!


Dr Terry Frankcombe
Physical Chemistry, Department of Chemistry
Göteborgs Universitet
SE-412 96 Göteborg Sweden
Ph: +46 76 224 0887   Skype: terry.frankcombe