Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Deadlock on large numbers of processors
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-01-12 15:19:04


Cross your fingers; we might release tomorrow (I've probably now
jinxed it by saying that!).

On Jan 12, 2009, at 1:54 PM, Justin wrote:

> In order for me to test this out I need to wait for TACC to install
> this version on Ranger. Right now they have version 1.3a1r19685
> installed. I'm guessing this is probably an older version. I'm not
> sure when TACC will get around to updating there OpenMPI version. I
> could request them to update it but it would be a lot easier to
> request an actual release. What is the current schedule for the 1.3
> release?
>
> Justin
>
> Jeff Squyres wrote:
>> Justin --
>>
>> Could you actually give your code a whirl with 1.3rc3 to ensure
>> that it fixes the problem for you?
>>
>> http://www.open-mpi.org/software/ompi/v1.3/
>>
>>
>> On Jan 12, 2009, at 1:30 PM, Tim Mattox wrote:
>>
>>> Hi Justin,
>>> I applied the fixes for this particular deadlock to the 1.3 code
>>> base
>>> late last week, see ticket #1725:
>>> https://svn.open-mpi.org/trac/ompi/ticket/1725
>>>
>>> This should fix the described problem, but I personally have not
>>> tested
>>> to see if the deadlock in question is now gone. Everyone should
>>> give
>>> thanks to George for his efforts in tracking down the problem
>>> and finding a solution.
>>> -- Tim Mattox, the v1.3 gatekeeper
>>>
>>> On Mon, Jan 12, 2009 at 12:46 PM, Justin <luitjens_at_[hidden]>
>>> wrote:
>>>> Hi, has this deadlock been fixed in the 1.3 source yet?
>>>>
>>>> Thanks,
>>>>
>>>> Justin
>>>>
>>>>
>>>> Jeff Squyres wrote:
>>>>>
>>>>> On Dec 11, 2008, at 5:30 PM, Justin wrote:
>>>>>
>>>>>> The more I look at this bug the more I'm convinced it is with
>>>>>> openMPI and
>>>>>> not our code. Here is why: Our code generates a communication/
>>>>>> execution
>>>>>> schedule. At each timestep this schedule is executed and all
>>>>>> communication
>>>>>> and execution is performed. Our problem is AMR which means the
>>>>>> communication schedule may change from time to time. In this
>>>>>> case the
>>>>>> schedule has not changed in many timesteps meaning the same
>>>>>> communication
>>>>>> schedule is being used as the last X (x being around 20 in this
>>>>>> case)
>>>>>> timesteps.
>>>>>> Our code does have a very large communication problem. I have
>>>>>> been able
>>>>>> to reduce the hang down to 16 processors and it seems to me the
>>>>>> hang occurs
>>>>>> when he have lots of work per processor. Meaning if I add more
>>>>>> processors
>>>>>> it may not hang but reducing processors makes it more likely to
>>>>>> hang.
>>>>>> What is the status on the fix for this particular freelist
>>>>>> deadlock?
>>>>>
>>>>>
>>>>> George is actively working on it because it is the "last" issue
>>>>> blocking
>>>>> us from releasing v1.3. I fear that if he doesn't get it fixed
>>>>> by tonight,
>>>>> we'll have to push v1.3 to next year (see
>>>>> http://www.open-mpi.org/community/lists/devel/2008/12/5029.php and
>>>>> http://www.open-mpi.org/community/lists/users/2008/12/7499.php).
>>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>> --
>>> Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
>>> tmattox_at_[hidden] || timattox_at_[hidden]
>>> I'm a bright... http://www.the-brights.net/
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems