I applied the fixes for this particular deadlock to the 1.3 code base
late last week, see ticket #1725:
This should fix the described problem, but I personally have not tested
to see if the deadlock in question is now gone. Everyone should give
thanks to George for his efforts in tracking down the problem
and finding a solution.
-- Tim Mattox, the v1.3 gatekeeper
On Mon, Jan 12, 2009 at 12:46 PM, Justin <luitjens_at_[hidden]> wrote:
> Hi, has this deadlock been fixed in the 1.3 source yet?
> Jeff Squyres wrote:
>> On Dec 11, 2008, at 5:30 PM, Justin wrote:
>>> The more I look at this bug the more I'm convinced it is with openMPI and
>>> not our code. Here is why: Our code generates a communication/execution
>>> schedule. At each timestep this schedule is executed and all communication
>>> and execution is performed. Our problem is AMR which means the
>>> communication schedule may change from time to time. In this case the
>>> schedule has not changed in many timesteps meaning the same communication
>>> schedule is being used as the last X (x being around 20 in this case)
>>> Our code does have a very large communication problem. I have been able
>>> to reduce the hang down to 16 processors and it seems to me the hang occurs
>>> when he have lots of work per processor. Meaning if I add more processors
>>> it may not hang but reducing processors makes it more likely to hang.
>>> What is the status on the fix for this particular freelist deadlock?
>> George is actively working on it because it is the "last" issue blocking
>> us from releasing v1.3. I fear that if he doesn't get it fixed by tonight,
>> we'll have to push v1.3 to next year (see
>> http://www.open-mpi.org/community/lists/devel/2008/12/5029.php and
> users mailing list
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
tmattox_at_[hidden] || timattox_at_[hidden]
I'm a bright... http://www.the-brights.net/