Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Deadlock on large numbers of processors
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-01-12 13:41:49


Justin --

Could you actually give your code a whirl with 1.3rc3 to ensure that
it fixes the problem for you?

     http://www.open-mpi.org/software/ompi/v1.3/

On Jan 12, 2009, at 1:30 PM, Tim Mattox wrote:

> Hi Justin,
> I applied the fixes for this particular deadlock to the 1.3 code base
> late last week, see ticket #1725:
> https://svn.open-mpi.org/trac/ompi/ticket/1725
>
> This should fix the described problem, but I personally have not
> tested
> to see if the deadlock in question is now gone. Everyone should give
> thanks to George for his efforts in tracking down the problem
> and finding a solution.
> -- Tim Mattox, the v1.3 gatekeeper
>
> On Mon, Jan 12, 2009 at 12:46 PM, Justin <luitjens_at_[hidden]> wrote:
>> Hi, has this deadlock been fixed in the 1.3 source yet?
>>
>> Thanks,
>>
>> Justin
>>
>>
>> Jeff Squyres wrote:
>>>
>>> On Dec 11, 2008, at 5:30 PM, Justin wrote:
>>>
>>>> The more I look at this bug the more I'm convinced it is with
>>>> openMPI and
>>>> not our code. Here is why: Our code generates a communication/
>>>> execution
>>>> schedule. At each timestep this schedule is executed and all
>>>> communication
>>>> and execution is performed. Our problem is AMR which means the
>>>> communication schedule may change from time to time. In this
>>>> case the
>>>> schedule has not changed in many timesteps meaning the same
>>>> communication
>>>> schedule is being used as the last X (x being around 20 in this
>>>> case)
>>>> timesteps.
>>>> Our code does have a very large communication problem. I have
>>>> been able
>>>> to reduce the hang down to 16 processors and it seems to me the
>>>> hang occurs
>>>> when he have lots of work per processor. Meaning if I add more
>>>> processors
>>>> it may not hang but reducing processors makes it more likely to
>>>> hang.
>>>> What is the status on the fix for this particular freelist
>>>> deadlock?
>>>
>>>
>>> George is actively working on it because it is the "last" issue
>>> blocking
>>> us from releasing v1.3. I fear that if he doesn't get it fixed by
>>> tonight,
>>> we'll have to push v1.3 to next year (see
>>> http://www.open-mpi.org/community/lists/devel/2008/12/5029.php and
>>> http://www.open-mpi.org/community/lists/users/2008/12/7499.php).
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> --
> Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
> tmattox_at_[hidden] || timattox_at_[hidden]
> I'm a bright... http://www.the-brights.net/
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems