Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [OMPI users] Memory manager
From: Terry Frankcombe (terry_at_[hidden])
Date: 2007-11-27 18:13:29


Hi Jeff

> > I posted this to the devel list the other day, but it raised no
> > responses. Maybe people will have more to say here.
>
> Sorry Terry; many of us were at the SC conference last week, and this
> week is short because of the US holiday. Some of the inbox got
> dropped/delayed as a result...

'Tis OK. Beggars can't be choosers! ;-)

<snip>

> > Because of this I can't reduce the problem to a small testcase, and so
> > have not included any code at this stage.
>
> Ugh. Heisenbugs are the worst.
>
> Have you tried with a memory checking debugger, such as valgrind, or a
> parallel debugger? Is there a chance that there's a simple errant
> posted receive (perhaps in a race condition) that is unexpectedly
> receiving into the Bug's memory location when you don't expect it?

I have zero experience with valgrind. But I downloaded it and ran my
"minimal" case (about 1000 lines + libraries!) with it. Thus I found
one uninitialised variable and need to go away and check my code
carefully now. Correcting this in the most obvious, un-thought-through
way makes my Bug go away. (But then so does changing the code in other,
unexecuted sections!)

However, what I get out of valgrind now is:

[tjf_at_fkpc167 Minimal]$ valgrind --leak-check=yes ./nnh
==20671== Memcheck, a memory error detector.
==20671== Copyright (C) 2002-2007, and GNU GPL'd, by Julian Seward et
al.
==20671== Using LibVEX rev 1732, a library for dynamic binary
translation.
==20671== Copyright (C) 2004-2007, and GNU GPL'd, by OpenWorks LLP.
==20671== Using valgrind-3.2.3, a dynamic binary instrumentation
framework.
==20671== Copyright (C) 2000-2007, and GNU GPL'd, by Julian Seward et
al.
==20671== For more details, rerun with: -v
==20671==
==20671== Conditional jump or move depends on uninitialised value(s)
==20671== at 0x40152B1: (within /lib/ld-2.5.so)
==20671== by 0x4005278: (within /lib/ld-2.5.so)
==20671== by 0x4007CFD: (within /lib/ld-2.5.so)
==20671== by 0x400318A: (within /lib/ld-2.5.so)
==20671== by 0x4013D9A: (within /lib/ld-2.5.so)
==20671== by 0x40012C6: (within /lib/ld-2.5.so)
==20671== by 0x4000A67: (within /lib/ld-2.5.so)

...<snip>...

==20671== Conditional jump or move depends on uninitialised value(s)
==20671== at 0x40152B1: (within /lib/ld-2.5.so)
==20671== by 0x400A289: (within /lib/ld-2.5.so)
==20671== by 0x6A42E4D: (within /lib/libc-2.5.so)
==20671== by 0x59AE0E3: (within /lib/libdl-2.5.so)
==20671== by 0x400D725: (within /lib/ld-2.5.so)
==20671== by 0x59AE4EC: (within /lib/libdl-2.5.so)
==20671== by 0x59AE099: dlsym (in /lib/libdl-2.5.so)
==20671== by 0x57610FB: vm_sym
(in /usr/local/lib/libopen-pal.so.0.0.0)
==20671== by 0x575E29E: lt_dlsym
(in /usr/local/lib/libopen-pal.so.0.0.0)
==20671== by 0x57666EF: open_component
(in /usr/local/lib/libopen-pal.so.0.0.0)
==20671== by 0x576711B: mca_base_component_find
(in /usr/local/lib/libopen-pal.so.0.0.0)
==20671== by 0x5767A9F: mca_base_components_open
(in /usr/local/lib/libopen-pal.so.0.0.0)

...<snip>...

<my code output, no valgrind errors within it>

==20671==
==20671== ERROR SUMMARY: 102 errors from 24 contexts (suppressed: 0 from
0)
==20671== malloc/free: in use at exit: 0 bytes in 0 blocks.
==20671== malloc/free: 0 allocs, 0 frees, 0 bytes allocated.
==20671== For counts of detected errors, rerun with: -v
==20671== All heap blocks were freed -- no leaks are possible.

This looks particularly broken!

I've just run valgrind on another (serial) piece of code on this machine
and got three of the unitialised jumps from within ld-2.5.so, virtually
identical to the first three from this MPI code. Of the 24 from the MPI
code, those seeming to originate from within OpenMPI are particularly
worrying.

Am I panicking for no reason, have I likely got a bad build or is
OpenMPI broken beyond repair?!!

> > If I run the code with mpirun -np 1 the problem goes away. So one
> > could
> > presumably simply say "always run it with mpirun." But if this is
> > required, why does OpenMPI not detect it?
>
> I'm not sure what you're asking -- Open MPI does not *require* you to
> run with mpirun...

That's exactly what I was asking. Cheers!

Ciao
Terry

-- 
Dr Terry Frankcombe
Physical Chemistry, Department of Chemistry
Göteborgs Universitet
SE-412 96 Göteborg Sweden
Ph: +46 76 224 0887   Skype: terry.frankcombe
<terry_at_[hidden]>