Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Heywood, Todd (heywood_at_[hidden])
Date: 2007-03-27 09:15:24


I tried the trunk version with "--mca btl tcp,self". Essentially system time
changes to idle time, since empty polling is being replaced by blocking
(right?). Page faults go to 0 though.

It is interesting since you can see what is going on now, with distinct
phases of user time and idle time (sleep mode, en masse). Before, vmstat
showed processes going into sleep mode rather randomly, and distinct phases
of mostly user time or mostly system time were not visible.

I also tried mpi_yield_when_idle=0 with the trunk version. No effect on
behavior.

Todd

On 3/23/07 7:15 PM, "George Bosilca" <bosilca_at_[hidden]> wrote:

> So far the described behavior seems as normal as expected. As Open
> MPI never goes in blocking mode, the processes will always spin
> between active and sleep mode. More processes on the same node leads
> to more time in the system mode (because of the empty polls). There
> is a trick in the trunk version of Open MPI which will trigger the
> blocking mode if and only if TCP is the only used device. Please try
> add "--mca btl tcp,self" to your mpirun command line, and check the
> output of vmstat.
>
> Thanks,
> george.
>
> On Mar 23, 2007, at 3:32 PM, Heywood, Todd wrote:
>
>> Rolf,
>>
>>> Is it possible that everything is working just as it should?
>>
>> That's what I'm afraid of :-). But I did not expect to see such
>> communication overhead due to blocking from mpiBLAST, which is very
>> course-grained. I then tried HPL, which is computation-heavy, and
>> found the
>> same thing. Also, the system time seemed to correspond to the MPI
>> processes
>> cycling between run and sleep (as seen via top), and I thought that
>> setting
>> the mpi_yield_when_idle parameter to 0 would keep the processes from
>> entering sleep state when blocking. But it doesn't.
>>
>> Todd
>>
>>
>>
>> On 3/23/07 2:06 PM, "Rolf Vandevaart" <Rolf.Vandevaart_at_[hidden]> wrote:
>>
>>>
>>> Todd:
>>>
>>> I assume the system time is being consumed by
>>> the calls to send and receive data over the TCP sockets.
>>> As the number of processes in the job increases, then more
>>> time is spent waiting for data from one of the other processes.
>>>
>>> I did a little experiment on a single node to see the difference
>>> in system time consumed when running over TCP vs when
>>> running over shared memory. When running on a single
>>> node and using the sm btl, I see almost 100% user time.
>>> I assume this is because the sm btl handles sending and
>>> receiving its data within a shared memory segment.
>>> However, when I switch over to TCP, I see my system time
>>> go up. Note that this is on Solaris.
>>>
>>> RUNNING OVER SELF,SM
>>>> mpirun -np 8 -mca btl self,sm hpcc.amd64
>>>
>>> PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG
>>> PROCESS/NLWP
>>> 3505 rolfv 100 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 75 182 0
>>> hpcc.amd64/1
>>> 3503 rolfv 100 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0 69 116 0
>>> hpcc.amd64/1
>>> 3499 rolfv 99 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0 106 236 0
>>> hpcc.amd64/1
>>> 3497 rolfv 99 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0 169 200 0
>>> hpcc.amd64/1
>>> 3501 rolfv 98 0.0 0.0 0.0 0.0 0.0 0.0 1.9 0 127 158 0
>>> hpcc.amd64/1
>>> 3507 rolfv 98 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0 244 200 0
>>> hpcc.amd64/1
>>> 3509 rolfv 98 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0 282 212 0
>>> hpcc.amd64/1
>>> 3495 rolfv 97 0.0 0.0 0.0 0.0 0.0 0.0 3.2 0 237 98 0
>>> hpcc.amd64/1
>>>
>>> RUNNING OVER SELF,TCP
>>>> mpirun -np 8 -mca btl self,tcp hpcc.amd64
>>>
>>> PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG
>>> PROCESS/NLWP
>>> 4316 rolfv 93 6.9 0.0 0.0 0.0 0.0 0.0 0.2 5 346 .6M 0
>>> hpcc.amd64/1
>>> 4328 rolfv 91 8.4 0.0 0.0 0.0 0.0 0.0 0.4 3 59 .15 0
>>> hpcc.amd64/1
>>> 4324 rolfv 98 1.1 0.0 0.0 0.0 0.0 0.0 0.7 2 270 .1M 0
>>> hpcc.amd64/1
>>> 4320 rolfv 88 12 0.0 0.0 0.0 0.0 0.0 0.8 4 244 .15 0
>>> hpcc.amd64/1
>>> 4322 rolfv 94 5.1 0.0 0.0 0.0 0.0 0.0 1.3 2 150 .2M 0
>>> hpcc.amd64/1
>>> 4318 rolfv 92 6.7 0.0 0.0 0.0 0.0 0.0 1.4 5 236 .9M 0
>>> hpcc.amd64/1
>>> 4326 rolfv 93 5.3 0.0 0.0 0.0 0.0 0.0 1.7 7 117 .2M 0
>>> hpcc.amd64/1
>>> 4314 rolfv 91 6.6 0.0 0.0 0.0 0.0 1.3 0.9 19 150 .10 0
>>> hpcc.amd64/1
>>>
>>> I also ran HPL over a larger cluster of 6 nodes, and noticed even
>>> higher
>>> system times.
>>>
>>> And lastly, I ran a simple MPI test over a cluster of 64 nodes, 2
>>> procs
>>> per node
>>> using Sun HPC ClusterTools 6, and saw about a 50/50 split between
>>> user
>>> and system time.
>>>
>>> PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG
>>> PROCESS/NLWP
>>> 11525 rolfv 55 44 0.1 0.0 0.0 0.0 0.1 0.4 76 960 .3M 0
>>> maxtrunc_ct6/1
>>> 11526 rolfv 54 45 0.0 0.0 0.0 0.0 0.0 1.0 0 362 .4M 0
>>> maxtrunc_ct6/1
>>>
>>> Is it possible that everything is working just as it should?
>>>
>>> Rolf
>>>
>>> Heywood, Todd wrote On 03/22/07 13:30,:
>>>
>>>> Ralph,
>>>>
>>>> Well, according to the FAQ, aggressive mode can be "forced" so I
>>>> did try
>>>> setting OMPI_MCA_mpi_yield_when_idle=0 before running. I also
>>>> tried turning
>>>> processor/memory affinity on. Efffects were minor. The MPI tasks
>>>> still cycle
>>>> bewteen run and sleep states, driving up system time well over
>>>> user time.
>>>>
>>>> Mpstat shows SGE is indeed giving 4 or 2 slots per node as
>>>> approporiate
>>>> (depending on memory) and the MPI tasks are using 4 or 2 cores,
>>>> but to be
>>>> sure, I also tried running directly with a hostfile with slots=4
>>>> or slots=2.
>>>> The same behavior occurs.
>>>>
>>>> This behavior is a function of the size of the job. I.e. As I
>>>> scale from 200
>>>> to 800 tasks the run/sleep cycling increases, so that system time
>>>> grows from
>>>> maybe half the user time to maybe 5 times user time.
>>>>
>>>> This is for TCP/gigE.
>>>>
>>>> Todd
>>>>
>>>>
>>>> On 3/22/07 12:19 PM, "Ralph Castain" <rhc_at_[hidden]> wrote:
>>>>
>>>>
>>>>
>>>>> Just for clarification: ompi_info only shows the *default* value
>>>>> of the MCA
>>>>> parameter. In this case, mpi_yield_when_idle defaults to
>>>>> aggressive, but
>>>>> that value is reset internally if the system sees an
>>>>> "oversubscribed"
>>>>> condition.
>>>>>
>>>>> The issue here isn't how many cores are on the node, but rather
>>>>> how many
>>>>> were specifically allocated to this job. If the allocation
>>>>> wasn't at least 2
>>>>> (in your example), then we would automatically reset
>>>>> mpi_yield_when_idle to
>>>>> be non-aggressive, regardless of how many cores are actually on
>>>>> the node.
>>>>>
>>>>> Ralph
>>>>>
>>>>>
>>>>> On 3/22/07 7:14 AM, "Heywood, Todd" <heywood_at_[hidden]> wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Yes, I'm using SGE. I also just noticed that when 2 tasks/slots
>>>>>> run on a
>>>>>> 4-core node, the 2 tasks are still cycling between run and
>>>>>> sleep, with
>>>>>> higher system time than user time.
>>>>>>
>>>>>> Ompi_info shows the MCA parameter mpi_yield_when_idle to be 0
>>>>>> (aggressive),
>>>>>> so that suggests the tasks aren't swapping out on bloccking calls.
>>>>>>
>>>>>> Still puzzled.
>>>>>>
>>>>>> Thanks,
>>>>>> Todd
>>>>>>
>>>>>>
>>>>>> On 3/22/07 7:36 AM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Are you using a scheduler on your system?
>>>>>>>
>>>>>>> More specifically, does Open MPI know that you have for
>>>>>>> process slots
>>>>>>> on each node? If you are using a hostfile and didn't specify
>>>>>>> "slots=4" for each host, Open MPI will think that it's
>>>>>>> oversubscribing and will therefore call sched_yield() in the
>>>>>>> depths
>>>>>>> of its progress engine.
>>>>>>>
>>>>>>>
>>>>>>> On Mar 21, 2007, at 5:08 PM, Heywood, Todd wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> P.s. I should have said this this is a pretty course-grained
>>>>>>>> application,
>>>>>>>> and netstat doesn't show much communication going on (except in
>>>>>>>> stages).
>>>>>>>>
>>>>>>>>
>>>>>>>> On 3/21/07 4:21 PM, "Heywood, Todd" <heywood_at_[hidden]> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> I noticed that my OpenMPI processes are using larger amounts of
>>>>>>>>> system time
>>>>>>>>> than user time (via vmstat, top). I'm running on dual-core,
>>>>>>>>> dual-CPU
>>>>>>>>> Opterons, with 4 slots per node, where the program has the
>>>>>>>>> nodes to
>>>>>>>>> themselves. A closer look showed that they are constantly
>>>>>>>>> switching between
>>>>>>>>> run and sleep states with 4-8 page faults per second.
>>>>>>>>>
>>>>>>>>> Why would this be? It doesn't happen with 4 sequential jobs
>>>>>>>>> running on a
>>>>>>>>> node, where I get 99% user time, maybe 1% system time.
>>>>>>>>>
>>>>>>>>> The processes have plenty of memory. This behavior occurs
>>>>>>>>> whether
>>>>>>>>> I use
>>>>>>>>> processor/memory affinity or not (there is no
>>>>>>>>> oversubscription).
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Todd
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users