Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OMPI 1.6.x Hang on khugepaged 100% CPU time
From: Yevgeny Kliteynik (kliteyn_at_[hidden])
Date: 2012-09-05 07:52:14

On 9/4/2012 7:21 PM, Yong Qin wrote:
> On Tue, Sep 4, 2012 at 5:42 AM, Yevgeny Kliteynik
> <kliteyn_at_[hidden]> wrote:
>> On 8/30/2012 10:28 PM, Yong Qin wrote:
>>> On Thu, Aug 30, 2012 at 5:12 AM, Jeff Squyres<jsquyres_at_[hidden]> wrote:
>>>> On Aug 29, 2012, at 2:25 PM, Yong Qin wrote:
>>>>> This issue has been observed on OMPI 1.6 and 1.6.1 with openib btl but
>>>>> not on 1.4.5 (tcp btl is always fine). The application is VASP and
>>>>> only one specific dataset is identified during the testing, and the OS
>>>>> is SL 6.2 with kernel 2.6.32-220.23.1.el6.x86_64. The issue is that
>>>>> when a certain type of load is put on OMPI 1.6.x, khugepaged thread
>>>>> always runs with 100% CPU load, and it looks to me like that OMPI is
>>>>> waiting for some memory to be available thus appears to be hung.
>>>>> Reducing the per node processes would sometimes ease the problem a bit
>>>>> but not always. So I did some further testing by playing around with
>>>>> the kernel transparent hugepage support.
>>>>> 1. Disable transparent hugepage support completely (echo never
>>>>>> /sys/kernel/mm/redhat_transparent_hugepage/enabled). This would allow
>>>>> the program to progress as normal (as in 1.4.5). Total run time for an
>>>>> iteration is 3036.03 s.
>>>> I'll admit that we have not tested using transparent hugepages. I wonder if there's some kind of bad interaction going on here...
>>> The transparent hugepage is "transparent", which means it is
>>> automatically applied to all applications unless it is explicitly told
>>> otherwise. I highly suspect that it is not working properly in this
>>> case.
>> Like Jeff said - I don't think we've ever tested OMPI with transparent
>> huge pages.
> Thanks. But have you tested OMPI under RHEL 6 or its variants (CentOS
> 6, SL 6)? THP is on by default in RHEL 6 so no matter you want it or
> not it's there.

Interesting. Indeed, THP is on be default in RHEL 6.x.
I run OMPI 1.6.x constantly on RHEL 6.2, and I've never seen this problem.

I'm checking it with OFED folks, but I doubt that there are some dedicated
tests for THP.

So do you see it only with a specific application and only on a specific
data set? Wonder if I can somehow reproduce it in-house...

-- YK