Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Possible memory error
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2012-12-15 07:34:23


On Dec 14, 2012, at 4:31 PM, Handerson, Steven wrote:

> I’m trying to track down an instance of openMPI writing to a freed block of memory.
> This occurs with the most recent release (1.6.3) as well as 1.6, on a 64 bit intel architecture, fedora 14.
> It occurs with a very simple reduction (allreduce minimum), over a single int value.

Can you send a reproducer program? The simpler, the better.

> I’m wondering if the openMPI developers use power tools such as valgrind / dmalloc / etc
> on the releases to try to catch these things via exhaustive testing –
> but I understand memory problems in C are of the nature that anyone making a mistake can propogate,
> so I haven’t ruled out problems in our own code.
> Also, I’m wondering if anyone has suggestions on how to track this down further.

Yes, we do use such tools.

Can you cite the specific file/line where the problem is occurring? The all reduce algorithms are fairly self-contained; it should be (relatively) straightforward to examine that code and see if there's a problem with the memory allocation there.

> I’m using allinea DDT and their builtin dmalloc, which catches the error, which appears in
> the second memcpy in opal_convertor_pack(), but I don’t have more details than that at the moment.
> All I know so far is that one of those values has been freed.
> Obviously, I haven’t seen anything in earlier parts of the code which might have triggered memory corruption,
> although both openMPI and intel IPP do things with uninitialized values before this (according to Valgrind).

There's a number of issues that can lead to false positives for using uninitialized values. Here's two of the most common cases:

1. When using TCP, one of our data headers has a padding hole in it, but we write the whole struct down a TCP socket file descriptor anyway. Hence, it will generate a "read from uninit" warning.

2. When using OpenFabrics-based networks, tool like valgrind don't see the OS-bypass initialization of the memory (Which frequently comes directly from the hardware), and it generates a lot of false "read from uninit" positives.

One thing you can try is to compile Open MPI --with-valgrind. This adds a little performance penalty, but we take extra steps to eliminate most false positives. It could help separate the wheat from the chaff, in your case.

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/