Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Heywood, Todd (heywood_at_[hidden])
Date: 2007-03-22 14:51:34


Hi,

It is v1.2, default configuration. If it matters: OS is RHEL
(2.6.9-42.0.3.ELsmp) on x86_64.

I have noticed this for 2 apps so far, mpiBLAST and HPL, which are both
course grained.

Thanks,

Todd

On 3/22/07 2:38 PM, "Ralph Castain" <rhc_at_[hidden]> wrote:

>
>
>
> On 3/22/07 11:30 AM, "Heywood, Todd" <heywood_at_[hidden]> wrote:
>
>> Ralph,
>>
>> Well, according to the FAQ, aggressive mode can be "forced" so I did try
>> setting OMPI_MCA_mpi_yield_when_idle=0 before running. I also tried turning
>> processor/memory affinity on. Efffects were minor. The MPI tasks still cycle
>> bewteen run and sleep states, driving up system time well over user time.
>
> Yes, that's true - and we do (should) respect any such directive.
>
>>
>> Mpstat shows SGE is indeed giving 4 or 2 slots per node as approporiate
>> (depending on memory) and the MPI tasks are using 4 or 2 cores, but to be
>> sure, I also tried running directly with a hostfile with slots=4 or slots=2.
>> The same behavior occurs.
>
> Okay - thanks for trying that!
>
>>
>> This behavior is a function of the size of the job. I.e. As I scale from 200
>> to 800 tasks the run/sleep cycling increases, so that system time grows from
>> maybe half the user time to maybe 5 times user time.
>>
>> This is for TCP/gigE.
>
> What version of OpenMPI are you using? This sounds like something we need to
> investigate.
>
> Thanks for the help!
> Ralph
>
>>
>> Todd
>>
>>
>> On 3/22/07 12:19 PM, "Ralph Castain" <rhc_at_[hidden]> wrote:
>>
>>> Just for clarification: ompi_info only shows the *default* value of the MCA
>>> parameter. In this case, mpi_yield_when_idle defaults to aggressive, but
>>> that value is reset internally if the system sees an "oversubscribed"
>>> condition.
>>>
>>> The issue here isn't how many cores are on the node, but rather how many
>>> were specifically allocated to this job. If the allocation wasn't at least 2
>>> (in your example), then we would automatically reset mpi_yield_when_idle to
>>> be non-aggressive, regardless of how many cores are actually on the node.
>>>
>>> Ralph
>>>
>>>
>>> On 3/22/07 7:14 AM, "Heywood, Todd" <heywood_at_[hidden]> wrote:
>>>
>>>> Yes, I'm using SGE. I also just noticed that when 2 tasks/slots run on a
>>>> 4-core node, the 2 tasks are still cycling between run and sleep, with
>>>> higher system time than user time.
>>>>
>>>> Ompi_info shows the MCA parameter mpi_yield_when_idle to be 0 (aggressive),
>>>> so that suggests the tasks aren't swapping out on bloccking calls.
>>>>
>>>> Still puzzled.
>>>>
>>>> Thanks,
>>>> Todd
>>>>
>>>>
>>>> On 3/22/07 7:36 AM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
>>>>
>>>>> Are you using a scheduler on your system?
>>>>>
>>>>> More specifically, does Open MPI know that you have for process slots
>>>>> on each node? If you are using a hostfile and didn't specify
>>>>> "slots=4" for each host, Open MPI will think that it's
>>>>> oversubscribing and will therefore call sched_yield() in the depths
>>>>> of its progress engine.
>>>>>
>>>>>
>>>>> On Mar 21, 2007, at 5:08 PM, Heywood, Todd wrote:
>>>>>
>>>>>> P.s. I should have said this this is a pretty course-grained
>>>>>> application,
>>>>>> and netstat doesn't show much communication going on (except in
>>>>>> stages).
>>>>>>
>>>>>>
>>>>>> On 3/21/07 4:21 PM, "Heywood, Todd" <heywood_at_[hidden]> wrote:
>>>>>>
>>>>>>> I noticed that my OpenMPI processes are using larger amounts of
>>>>>>> system time
>>>>>>> than user time (via vmstat, top). I'm running on dual-core, dual-CPU
>>>>>>> Opterons, with 4 slots per node, where the program has the nodes to
>>>>>>> themselves. A closer look showed that they are constantly
>>>>>>> switching between
>>>>>>> run and sleep states with 4-8 page faults per second.
>>>>>>>
>>>>>>> Why would this be? It doesn't happen with 4 sequential jobs
>>>>>>> running on a
>>>>>>> node, where I get 99% user time, maybe 1% system time.
>>>>>>>
>>>>>>> The processes have plenty of memory. This behavior occurs whether
>>>>>>> I use
>>>>>>> processor/memory affinity or not (there is no oversubscription).
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Todd
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users