Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [OMPI users] Memory manager
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-11-26 17:59:54


On Nov 20, 2007, at 6:52 AM, Terry Frankcombe wrote:

> I posted this to the devel list the other day, but it raised no
> responses. Maybe people will have more to say here.

Sorry Terry; many of us were at the SC conference last week, and this
week is short because of the US holiday. Some of the inbox got
dropped/delayed as a result...

(case in point: this mail sat unfinished on my laptop until I returned
from the holiday today -- sorry!)

> Questions: How much does using the MPI wrappers influence the memory
> management at runtime?

I'm not sure what you mean here, but it's not really the MPI wrappers
that are at issue. Rather, it's whether support for the memory
manager was compiled into the Open MPI libraries or not. For example
(and I just double checked this to be sure) -- I compiled OMPI with
and without the memory manager on RHEL4U4 and the output from "mpicc --
showme" is exactly the same.

> What has changed in this regard from 1.2.3 to 1.2.4?

Nothing, AFAIK...? I don't see anything in NEWS w.r.t. the memory
manager stuff for v1.2.4.

> The reason I ask is that I have an f90 code that does very strange
> things. The structure of the code is not all that straightforward,
> with
> a "tree" of modules usually allocating their own storage (all with
> save
> applied globally within the module). Compiling with OpenMPI 1.2.4
> coupled to a gcc 4.3.0 prerelease and running as a single process
> (with
> no explicit mpirun), the elements of one particular array seem to
> revert
> to previous values between where they are set and a later part of the
> code. (I'll refer to this as The Bug, and having the matrix elements
> stay as set as "expected behaviour".)

Yoinks. :-(

> The most obvious explanation would be a coding error. However,
> compiling and running this with OpenMPI 1.2.3 gives me the expected
> behaviour! As does compiling and running with a different MPI
> implementation and compiler set. Replacing the prerelease gcc 4.3.0
> with the released 4.2.2 makes no change.
>
> The Bug is unstable. Removing calls to various routines in used
> modules
> (that otherwise do not effect the results) returns to expected
> behaviour
> at runtime. Removing a call to MPI_Recv that is never called
> returns to
> expected behaviour.
>
> Because of this I can't reduce the problem to a small testcase, and so
> have not included any code at this stage.

Ugh. Heisenbugs are the worst.

Have you tried with a memory checking debugger, such as valgrind, or a
parallel debugger? Is there a chance that there's a simple errant
posted receive (perhaps in a race condition) that is unexpectedly
receiving into the Bug's memory location when you don't expect it?

> If I run the code with mpirun -np 1 the problem goes away. So one
> could
> presumably simply say "always run it with mpirun." But if this is
> required, why does OpenMPI not detect it?

I'm not sure what you're asking -- Open MPI does not *require* you to
run with mpirun. Indeed, the memory management stuff that is in Open
MPI doesn't require the use of mpirun (or not). If you run without
mpirun, you'll get an MPI_COMM_WORLD size of 1 (known as a "singleton"
MPI job).

> And why the difference
> between 1.2.3 and 1.2.4?

There are lots of differences between 1.2.3 and 1.2.4 -- see:

     https://svn.open-mpi.org/trac/ompi/browser/branches/v1.2/NEWS

As for what exactly would cause it to exhibit the Bug behavior in
1.2.4 and not in 1.2.3 -- I don't know. As I said above, Heisenbugs
are the worst -- changing one thing makes it [seem to] go away, etc.
It could be that the Bug still exists but simply is not being obvious
when you use 1.2.3. Buffer overflows can be like that, for example --
if you overflow into an area of memory that doesn't matter, then
you'll never notice the bug. But if you move some data around, now
perhaps that same buffer overflow will overwrite some critical memory
and you *will* notice the Bug.

-- 
Jeff Squyres
Cisco Systems