Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] curious behavior during wait for broadcast: 100% cpu
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2010-12-13 11:28:24


Ralph and I chatted on the phone about this. Let's clarify a few things here for the user list:

1. It looks like we don't have this issue explicitly discussed on the FAQ. We obliquely discuss it in:

http://www.open-mpi.org/faq/?category=all#oversubscribing
and
http://www.open-mpi.org/faq/?category=all#force-aggressive-degraded

I'll try to fix that this week.

2. Ralph's initial description is still correct. OMPI calls sched_yield() in the middle of its progress loop when you enable the yield_when_idle behavior. This will *not* cause a (significant) reduction of CPU utilization because OMPI is still busy-polling. But it will yield periodically so that other processes *can* run if the OS allows them to. Due to OS bookkeeping, this yielding behavior may result in a slight reduction of top/ps-reported CPU utilization. But it's likely not significant.

3. I was trying to point out that the exact behavior of sched_yield() (which OMPI uses to yield in Linux) has changed in the Linux kernel over time. There's an interesting discussion on the MPICH mailing list archives back in 2007 about what exactly this means to MPI process performance -- read this thread all the way through (the Linux sched_yield() discussion is near the end):

    https://lists.mcs.anl.gov/mailman/htdig/mpich-discuss/2007-September/002711.html

On Dec 13, 2010, at 10:52 AM, Jeff Squyres wrote:

> I think there *was* a decision and it effectively changed how sched_yield() effectively operates, and that it may not do what we expect any more.
>
> See this thread (the discussion of Linux/sched_yield() comes in the later messages):
>
> http://www.open-mpi.org/community/lists/users/2010/07/13729.php
>
> I believe there's similar threads in the MPICH mailing list archives; that's why Dave posted on the OMPI list about it.
>
> We briefly discussed replacing OMPI's sched_yield() with a usleep(1), but it was shot down.
>
>
> On Dec 13, 2010, at 10:47 AM, Ralph Castain wrote:
>
>> Thanks for the link!
>>
>> Just to clarify for the list, my original statement is essentially correct. When calling sched_yield, we give up the remaining portion of our time slice.
>>
>> The issue in the kernel world centers around where to put you in the scheduling cycle once you have called sched_yield. Do you go to the end of the schedule for your priority? Do you go to the end of the schedule for all priorities? Or...where?
>>
>> Looks like they decided to not decide, and left several options available. Not entirely clear of the default, and they recommend we not use sched_yield and release the time some other method. We'll take this up on the developer list to see what (if anything) we want to do about it.
>>
>> Bottom line for users: the results remain the same. If no other process wants time, you'll continue to see near 100% utilization even if we yield because we will always poll for some time before deciding to yield.
>>
>>
>> On Dec 13, 2010, at 7:52 AM, Jeff Squyres wrote:
>>
>>> See the discussion on kerneltrap:
>>>
>>> http://kerneltrap.org/Linux/CFS_and_sched_yield
>>>
>>> Looks like the change came in somewhere around 2.6.23 or so...?
>>>
>>>
>>>
>>> On Dec 13, 2010, at 9:38 AM, Ralph Castain wrote:
>>>
>>>> Could you at least provide a one-line explanation of that statement?
>>>>
>>>>
>>>> On Dec 13, 2010, at 7:31 AM, Jeff Squyres wrote:
>>>>
>>>>> Also note that recent versions of the Linux kernel have changed what sched_yield() does -- it no longer does essentially what Ralph describes below. Google around to find those discussions.
>>>>>
>>>>>
>>>>> On Dec 9, 2010, at 4:07 PM, Ralph Castain wrote:
>>>>>
>>>>>> Sorry for delay - am occupied with my day job.
>>>>>>
>>>>>> Yes, that is correct to an extent. When you yield the processor, all that happens is that you surrender the rest of your scheduled time slice back to the OS. The OS then cycles thru its scheduler and sequentially assigns the processor to the line of waiting processes. Eventually, the OS will cycle back to your process, and you'll begin cranking again.
>>>>>>
>>>>>> So if no other process wants or needs attention, then yes - it will cycle back around to you pretty quickly. In cases where only system processes are running (besides my MPI ones, of course), then I'll typically see cpu usage drop a few percentage points - down to like 95% - because most system tools are very courteous and call yield is they don't need to do something. If there is something out there that wants time, or is less courteous, then my cpu usage can change a great deal.
>>>>>>
>>>>>> Note, though, that top and ps are -very- coarse measuring tools. You'll probably see them reading more like 100% simply because, averaged out over their sampling periods, nobody else is using enough to measure the difference.
>>>>>>
>>>>>>
>>>>>> On Dec 9, 2010, at 1:37 PM, Hicham Mouline wrote:
>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
>>>>>>>> Behalf Of Eugene Loh
>>>>>>>> Sent: 08 December 2010 16:19
>>>>>>>> To: Open MPI Users
>>>>>>>> Subject: Re: [OMPI users] curious behavior during wait for broadcast:
>>>>>>>> 100% cpu
>>>>>>>>
>>>>>>>> I wouldn't mind some clarification here. Would CPU usage really
>>>>>>>> decrease, or would other processes simply have an easier time getting
>>>>>>>> cycles? My impression of yield was that if there were no one to yield
>>>>>>>> to, the "yielding" process would still go hard. Conversely, turning on
>>>>>>>> "yield" would still show 100% cpu, but it would be easier for other
>>>>>>>> processes to get time.
>>>>>>>>
>>>>>>> Any clarifications?
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>>> --
>>>>> Jeff Squyres
>>>>> jsquyres_at_[hidden]
>>>>> For corporate legal information go to:
>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> --
>>> Jeff Squyres
>>> jsquyres_at_[hidden]
>>> For corporate legal information go to:
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/