Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Deadlock on large numbers of processors
From: Justin (luitjens_at_[hidden])
Date: 2009-01-12 13:54:20


In order for me to test this out I need to wait for TACC to install this
version on Ranger. Right now they have version 1.3a1r19685 installed.
I'm guessing this is probably an older version. I'm not sure when TACC
will get around to updating there OpenMPI version. I could request them
to update it but it would be a lot easier to request an actual release.
What is the current schedule for the 1.3 release?

Justin

Jeff Squyres wrote:
> Justin --
>
> Could you actually give your code a whirl with 1.3rc3 to ensure that
> it fixes the problem for you?
>
> http://www.open-mpi.org/software/ompi/v1.3/
>
>
> On Jan 12, 2009, at 1:30 PM, Tim Mattox wrote:
>
>> Hi Justin,
>> I applied the fixes for this particular deadlock to the 1.3 code base
>> late last week, see ticket #1725:
>> https://svn.open-mpi.org/trac/ompi/ticket/1725
>>
>> This should fix the described problem, but I personally have not tested
>> to see if the deadlock in question is now gone. Everyone should give
>> thanks to George for his efforts in tracking down the problem
>> and finding a solution.
>> -- Tim Mattox, the v1.3 gatekeeper
>>
>> On Mon, Jan 12, 2009 at 12:46 PM, Justin <luitjens_at_[hidden]> wrote:
>>> Hi, has this deadlock been fixed in the 1.3 source yet?
>>>
>>> Thanks,
>>>
>>> Justin
>>>
>>>
>>> Jeff Squyres wrote:
>>>>
>>>> On Dec 11, 2008, at 5:30 PM, Justin wrote:
>>>>
>>>>> The more I look at this bug the more I'm convinced it is with
>>>>> openMPI and
>>>>> not our code. Here is why: Our code generates a
>>>>> communication/execution
>>>>> schedule. At each timestep this schedule is executed and all
>>>>> communication
>>>>> and execution is performed. Our problem is AMR which means the
>>>>> communication schedule may change from time to time. In this case
>>>>> the
>>>>> schedule has not changed in many timesteps meaning the same
>>>>> communication
>>>>> schedule is being used as the last X (x being around 20 in this case)
>>>>> timesteps.
>>>>> Our code does have a very large communication problem. I have
>>>>> been able
>>>>> to reduce the hang down to 16 processors and it seems to me the
>>>>> hang occurs
>>>>> when he have lots of work per processor. Meaning if I add more
>>>>> processors
>>>>> it may not hang but reducing processors makes it more likely to hang.
>>>>> What is the status on the fix for this particular freelist deadlock?
>>>>
>>>>
>>>> George is actively working on it because it is the "last" issue
>>>> blocking
>>>> us from releasing v1.3. I fear that if he doesn't get it fixed by
>>>> tonight,
>>>> we'll have to push v1.3 to next year (see
>>>> http://www.open-mpi.org/community/lists/devel/2008/12/5029.php and
>>>> http://www.open-mpi.org/community/lists/users/2008/12/7499.php).
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> --
>> Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
>> tmattox_at_[hidden] || timattox_at_[hidden]
>> I'm a bright... http://www.the-brights.net/
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>