On Dec 14, 2012, at 4:31 PM, Handerson, Steven wrote:
> Im trying to track down an instance of openMPI writing to a freed block of memory.
> This occurs with the most recent release (1.6.3) as well as 1.6, on a 64 bit intel architecture, fedora 14.
> It occurs with a very simple reduction (allreduce minimum), over a single int value.
Can you send a reproducer program? The simpler, the better.
> Im wondering if the openMPI developers use power tools such as valgrind / dmalloc / etc
> on the releases to try to catch these things via exhaustive testing
> but I understand memory problems in C are of the nature that anyone making a mistake can propogate,
> so I havent ruled out problems in our own code.
> Also, Im wondering if anyone has suggestions on how to track this down further.
Yes, we do use such tools.
Can you cite the specific file/line where the problem is occurring? The all reduce algorithms are fairly self-contained; it should be (relatively) straightforward to examine that code and see if there's a problem with the memory allocation there.
> Im using allinea DDT and their builtin dmalloc, which catches the error, which appears in
> the second memcpy in opal_convertor_pack(), but I dont have more details than that at the moment.
> All I know so far is that one of those values has been freed.
> Obviously, I havent seen anything in earlier parts of the code which might have triggered memory corruption,
> although both openMPI and intel IPP do things with uninitialized values before this (according to Valgrind).
There's a number of issues that can lead to false positives for using uninitialized values. Here's two of the most common cases:
1. When using TCP, one of our data headers has a padding hole in it, but we write the whole struct down a TCP socket file descriptor anyway. Hence, it will generate a "read from uninit" warning.
2. When using OpenFabrics-based networks, tool like valgrind don't see the OS-bypass initialization of the memory (Which frequently comes directly from the hardware), and it generates a lot of false "read from uninit" positives.
One thing you can try is to compile Open MPI --with-valgrind. This adds a little performance penalty, but we take extra steps to eliminate most false positives. It could help separate the wheat from the chaff, in your case.
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/