On Dec 11, 2008, at 5:30 PM, Justin wrote:
> The more I look at this bug the more I'm convinced it is with
> openMPI and not our code. Here is why: Our code generates a
> communication/execution schedule. At each timestep this schedule is
> executed and all communication and execution is performed. Our
> problem is AMR which means the communication schedule may change
> from time to time. In this case the schedule has not changed in
> many timesteps meaning the same communication schedule is being used
> as the last X (x being around 20 in this case) timesteps.
> Our code does have a very large communication problem. I have been
> able to reduce the hang down to 16 processors and it seems to me the
> hang occurs when he have lots of work per processor. Meaning if I
> add more processors it may not hang but reducing processors makes it
> more likely to hang.
> What is the status on the fix for this particular freelist deadlock?
George is actively working on it because it is the "last" issue
blocking us from releasing v1.3. I fear that if he doesn't get it
fixed by tonight, we'll have to push v1.3 to next year (see http://www.open-mpi.org/community/lists/devel/2008/12/5029.php