Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Deadlock on large numbers of processors
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-12-11 17:44:58


On Dec 11, 2008, at 5:30 PM, Justin wrote:

> The more I look at this bug the more I'm convinced it is with
> openMPI and not our code. Here is why: Our code generates a
> communication/execution schedule. At each timestep this schedule is
> executed and all communication and execution is performed. Our
> problem is AMR which means the communication schedule may change
> from time to time. In this case the schedule has not changed in
> many timesteps meaning the same communication schedule is being used
> as the last X (x being around 20 in this case) timesteps.
> Our code does have a very large communication problem. I have been
> able to reduce the hang down to 16 processors and it seems to me the
> hang occurs when he have lots of work per processor. Meaning if I
> add more processors it may not hang but reducing processors makes it
> more likely to hang.
> What is the status on the fix for this particular freelist deadlock?

George is actively working on it because it is the "last" issue
blocking us from releasing v1.3. I fear that if he doesn't get it
fixed by tonight, we'll have to push v1.3 to next year (see http://www.open-mpi.org/community/lists/devel/2008/12/5029.php
  and http://www.open-mpi.org/community/lists/users/2008/12/7499.php).

-- 
Jeff Squyres
Cisco Systems