Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Possible memory error
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2012-12-20 08:01:55


On Dec 19, 2012, at 11:26 AM, Handerson, Steven wrote:

> I fixed the problem we were experiencing by adding a barrier.
> The bug occurred between a piece of code that uses (many, over a loop) SEND (from the leader)
> and RECV (in the worker processes) to ship data to the
> processing nodes from the head / leader, and I think what might have been happening is
> that this communication was mixed up with the following allreduce, when there's no barrier.
>
> The bug shows up in Valgrind and dmalloc as a read from freed memory.

Hmm. This sounds sketchy (meaning: it *sounds* like this is a valid communication pattern, but it's impossible to tell without seeing the code).

> I might spend some time trying to make a small piece of code that reproduces this,

If you have the time, that would be great.

> but maybe this gives you some idea of what might be the issue,
> if it's something that should be fixed.
> Some more info: it happens even as far back as openMPI 1.3.4, and even in the newest 1.6.3.

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/