Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OMPI 1.6.x Hang on khugepaged 100% CPU time
From: Yevgeny Kliteynik (kliteyn_at_[hidden])
Date: 2012-09-04 08:42:22


On 8/30/2012 10:28 PM, Yong Qin wrote:
> On Thu, Aug 30, 2012 at 5:12 AM, Jeff Squyres<jsquyres_at_[hidden]> wrote:
>> On Aug 29, 2012, at 2:25 PM, Yong Qin wrote:
>>
>>> This issue has been observed on OMPI 1.6 and 1.6.1 with openib btl but
>>> not on 1.4.5 (tcp btl is always fine). The application is VASP and
>>> only one specific dataset is identified during the testing, and the OS
>>> is SL 6.2 with kernel 2.6.32-220.23.1.el6.x86_64. The issue is that
>>> when a certain type of load is put on OMPI 1.6.x, khugepaged thread
>>> always runs with 100% CPU load, and it looks to me like that OMPI is
>>> waiting for some memory to be available thus appears to be hung.
>>> Reducing the per node processes would sometimes ease the problem a bit
>>> but not always. So I did some further testing by playing around with
>>> the kernel transparent hugepage support.
>>>
>>> 1. Disable transparent hugepage support completely (echo never
>>>> /sys/kernel/mm/redhat_transparent_hugepage/enabled). This would allow
>>> the program to progress as normal (as in 1.4.5). Total run time for an
>>> iteration is 3036.03 s.
>>
>> I'll admit that we have not tested using transparent hugepages. I wonder if there's some kind of bad interaction going on here...
>
> The transparent hugepage is "transparent", which means it is
> automatically applied to all applications unless it is explicitly told
> otherwise. I highly suspect that it is not working properly in this
> case.

Like Jeff said - I don't think we've ever tested OMPI with transparent
huge pages.

>>
>> What exactly does changing this setting do?
>
> Here (http://lwn.net/Articles/423592/) is a pretty good documentation
> on what these settings would do to the behaviour of the THP. I don't
> think I can explain it better than the article so I will leave it to
> you to digest. :)
>
>>
>>> 2. Disable VM defrag effort (echo never
>>>> /sys/kernel/mm/redhat_transparent_hugepage/defrag). This allows the
>>> program to run as well, but the performance is horrible. The same
>>> iteration takes 4967.40 s.
>>>
>>> 3. Disable defrag in khugepaged (echo no
>>>> /sys/kernel/mm/redhat_transparent_hugepage/khugepaged/defrag). This
>>> allows the program to run, and the performance is worse than #1 but
>>> better than #2. The same iteration takes 3348.10 s.
>>>
>>> 4. Disable both VM defrag and khugepaged defrag (#2 + #3). Similar
>>> performance as #3.
>>>
>>> So my question is, looks to me like this has to do with the memory
>>> management in the openib btl, are we using huge pages in 1.6.x? If
>>> that is true, is there a better way to resolve or workaround it within
>>> OMPI itself without disabling transparent hugepage support? We'd like
>>> to keep the hugepage support if possible.
>>
>> Mellanox -- can you comment on this?

Actually, I don't think that THP were really tested with OFED.
I can think of lots of ways thing can go wrong there.
This might be a good question to address to Linux-RDMA mailing list.

-- YK