Hi, has this deadlock been fixed in the 1.3 source yet?
Jeff Squyres wrote:
> On Dec 11, 2008, at 5:30 PM, Justin wrote:
>> The more I look at this bug the more I'm convinced it is with openMPI
>> and not our code. Here is why: Our code generates a
>> communication/execution schedule. At each timestep this schedule is
>> executed and all communication and execution is performed. Our
>> problem is AMR which means the communication schedule may change from
>> time to time. In this case the schedule has not changed in many
>> timesteps meaning the same communication schedule is being used as
>> the last X (x being around 20 in this case) timesteps.
>> Our code does have a very large communication problem. I have been
>> able to reduce the hang down to 16 processors and it seems to me the
>> hang occurs when he have lots of work per processor. Meaning if I
>> add more processors it may not hang but reducing processors makes it
>> more likely to hang.
>> What is the status on the fix for this particular freelist deadlock?
> George is actively working on it because it is the "last" issue
> blocking us from releasing v1.3. I fear that if he doesn't get it
> fixed by tonight, we'll have to push v1.3 to next year (see
> http://www.open-mpi.org/community/lists/devel/2008/12/5029.php and