I fixed the problem we were experiencing by adding a barrier.
The bug occurred between a piece of code that uses (many, over a loop) SEND (from the leader)
and RECV (in the worker processes) to ship data to the
processing nodes from the head / leader, and I think what might have been happening is
that this communication was mixed up with the following allreduce, when there's no barrier.
The bug shows up in Valgrind and dmalloc as a read from freed memory.
I might spend some time trying to make a small piece of code that reproduces this,
but maybe this gives you some idea of what might be the issue,
if it's something that should be fixed.
Some more info: it happens even as far back as openMPI 1.3.4, and even in the newest 1.6.3.
From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On Behalf Of Jeff Squyres
Sent: Saturday, December 15, 2012 7:34 AM
To: Open MPI Users
Subject: Re: [OMPI users] Possible memory error
On Dec 14, 2012, at 4:31 PM, Handerson, Steven wrote:
> I'm trying to track down an instance of openMPI writing to a freed block of memory.
> This occurs with the most recent release (1.6.3) as well as 1.6, on a 64 bit intel architecture, fedora 14.
> It occurs with a very simple reduction (allreduce minimum), over a single int value.
Can you send a reproducer program? The simpler, the better.
> I'm wondering if the openMPI developers use power tools such as
> valgrind / dmalloc / etc on the releases to try to catch these things
> via exhaustive testing - but I understand memory problems in C are of
> the nature that anyone making a mistake can propogate, so I haven't ruled out problems in our own code.
> Also, I'm wondering if anyone has suggestions on how to track this down further.
Yes, we do use such tools.
Can you cite the specific file/line where the problem is occurring? The all reduce algorithms are fairly self-contained; it should be (relatively) straightforward to examine that code and see if there's a problem with the memory allocation there.
> I'm using allinea DDT and their builtin dmalloc, which catches the
> error, which appears in the second memcpy in opal_convertor_pack(), but I don't have more details than that at the moment.
> All I know so far is that one of those values has been freed.
> Obviously, I haven't seen anything in earlier parts of the code which
> might have triggered memory corruption, although both openMPI and intel IPP do things with uninitialized values before this (according to Valgrind).
There's a number of issues that can lead to false positives for using uninitialized values. Here's two of the most common cases:
1. When using TCP, one of our data headers has a padding hole in it, but we write the whole struct down a TCP socket file descriptor anyway. Hence, it will generate a "read from uninit" warning.
2. When using OpenFabrics-based networks, tool like valgrind don't see the OS-bypass initialization of the memory (Which frequently comes directly from the hardware), and it generates a lot of false "read from uninit" positives.
One thing you can try is to compile Open MPI --with-valgrind. This adds a little performance penalty, but we take extra steps to eliminate most false positives. It could help separate the wheat from the chaff, in your case.
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
users mailing list