Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OMPI 1.6.x Hang on khugepaged 100% CPU time
From: Yong Qin (yong.qin_at_[hidden])
Date: 2012-09-05 12:07:35


Yes, so far this has only been observed in VASP and a specific dataset.

Thanks,

On Wed, Sep 5, 2012 at 4:52 AM, Yevgeny Kliteynik
<kliteyn_at_[hidden]> wrote:
> On 9/4/2012 7:21 PM, Yong Qin wrote:
>> On Tue, Sep 4, 2012 at 5:42 AM, Yevgeny Kliteynik
>> <kliteyn_at_[hidden]> wrote:
>>> On 8/30/2012 10:28 PM, Yong Qin wrote:
>>>> On Thu, Aug 30, 2012 at 5:12 AM, Jeff Squyres<jsquyres_at_[hidden]> wrote:
>>>>> On Aug 29, 2012, at 2:25 PM, Yong Qin wrote:
>>>>>
>>>>>> This issue has been observed on OMPI 1.6 and 1.6.1 with openib btl but
>>>>>> not on 1.4.5 (tcp btl is always fine). The application is VASP and
>>>>>> only one specific dataset is identified during the testing, and the OS
>>>>>> is SL 6.2 with kernel 2.6.32-220.23.1.el6.x86_64. The issue is that
>>>>>> when a certain type of load is put on OMPI 1.6.x, khugepaged thread
>>>>>> always runs with 100% CPU load, and it looks to me like that OMPI is
>>>>>> waiting for some memory to be available thus appears to be hung.
>>>>>> Reducing the per node processes would sometimes ease the problem a bit
>>>>>> but not always. So I did some further testing by playing around with
>>>>>> the kernel transparent hugepage support.
>>>>>>
>>>>>> 1. Disable transparent hugepage support completely (echo never
>>>>>>> /sys/kernel/mm/redhat_transparent_hugepage/enabled). This would allow
>>>>>> the program to progress as normal (as in 1.4.5). Total run time for an
>>>>>> iteration is 3036.03 s.
>>>>>
>>>>> I'll admit that we have not tested using transparent hugepages. I wonder if there's some kind of bad interaction going on here...
>>>>
>>>> The transparent hugepage is "transparent", which means it is
>>>> automatically applied to all applications unless it is explicitly told
>>>> otherwise. I highly suspect that it is not working properly in this
>>>> case.
>>>
>>> Like Jeff said - I don't think we've ever tested OMPI with transparent
>>> huge pages.
>>>
>>
>> Thanks. But have you tested OMPI under RHEL 6 or its variants (CentOS
>> 6, SL 6)? THP is on by default in RHEL 6 so no matter you want it or
>> not it's there.
>
> Interesting. Indeed, THP is on be default in RHEL 6.x.
> I run OMPI 1.6.x constantly on RHEL 6.2, and I've never seen this problem.
>
> I'm checking it with OFED folks, but I doubt that there are some dedicated
> tests for THP.
>
> So do you see it only with a specific application and only on a specific
> data set? Wonder if I can somehow reproduce it in-house...
>
> -- YK