Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] timing + /usr/bin/time
From: Raymond Wan (rwan_at_[hidden])
Date: 2008-11-17 04:34:20

Hi Jeff,

Thank you for your detailed explanation! I see your point and, given
what you said, I wonder if some people report user time (only) in order
to understate the execution time of their algorithms/programs.

Seems like the best solution is to have user, system, and MPI times with
the real time being the sum of all 3. I doubt that will ever happen
since, in reality, we have to add code to our programs to enable MPI and
sometimes the OS has to do some work on our programs' behalf. I guess
there is no clear division between user and system, in this case and the
total (real) time is the only one that makes sense.

Thank you again for the explanation!


Jeff Squyres wrote:
> FWIW, I *always* report MPI application time in wall-clock seconds
> time. I know that some people (even among the OMPI developers)
> disagree with me, but to me, there's nothing else that you can measure
> that makes sense.
> Case in point: when using the OpenFabrics network stack, very little
> time is spent in the kernel because OpenFabrics networks are designed
> to bypass the OS (e.g., we spin poll in userspace for OpenFabrics
> message passing progress). Similar is true for shared memory (it's a
> "network" because we use it to pass messages between MPI processes).
> But what about TCP? When not using a TOE or other similar technology
> (i.e., 99.99% of the time), you are making OS syscalls.
> Hence, running the same program over these three different networks
> can result in hugely different proportions of user vs. system time,
> even though it's the same app and the same algorithm. Granted, some
> of the networks are faster than the others, but the network should
> always be the slowest part of your computation (assuming you have a
> well-coded application). So which numbers should you report?
> In short: the MPI implementation is doing things for you behind the
> scenes. This raises some obvious questions:
> 1. Do you report the MPI execution times or not?
> 1a. If so, how do you account for the differences in network
> progression (and other issues) based on the type of network?
> 1b. If not, how can you separate the MPI time from your application
> time? (user/system does not make this differentiation; you need
> additional tools to separate MPI vs. application time)
> To me, only wall-clock execution time makes sense. The overall
> performance of your application *includes* the time necessary for
> MPI/message passing and everything else running on the machine. One
> of the major points of parallel computing is to make things go
> faster. To measure that, measure the wall-clock time of the
> application in serial and then measure the wall-clock execution time
> in parallel (perhaps for various different np values). Then you can
> (hopefully) see clear, easy-to-understand speedup. To avoid
> OS-induced jitter and negative timing effects, most people typically
> turn off as many OS services as possible on the nodes that they're
> running, both for production and benchmarking codes (I typically leave
> such services enabled on my software development nodes, because
> they're helpful for debugging, etc.).
> Is wall-clock execution time the only / best metric? Certainly not.
> But I strongly prefer it over user/system time -- I just don't think
> that user/system time tell you what most people think they're telling
> you in a parallel+MPI context.
> On Nov 14, 2008, at 4:32 AM, Raymond Wan wrote:
>> Hi Fabian,
>> Thank you for clarifying things and confirming some of the things
>> that I thought. I guess I have a clearer understanding now.
>> Fabian Hänsel wrote:
>>>> Hmmmm, I guess user time does not matter since it is real time that
>>>> we are interested in reducing.
>>> Right. Even if we *could* measure user time of every MPI worker process
>>> correctly this was not what you are interested in: Depending on the
>>> algorithm a significant amount of time could get spend waiting for MPI
>>> messages to arrive - and that time would not count as user time, but
>>> also was not 'wasted' as something important happens.
>> The reason why I was wondering is that some people in research papers
>> compare their algorithm (system) with another one by measuring user
>> time since it removes some of the effects of what the system does on
>> behalf of the user's process. And some people, I guess, see this as
>> a fairer comparison.
>> On the other hand, I guess I've realized the obvious -- that Open MPI
>> doesn't reduce the efficiency of the algorithm. Even worse,
>> increases in user time is an artifact of Open MPI, so it is somewhat
>> misleading if we are analyzing an algorithm. What MPI should do (if
>> properly used) is to reduce the real time and that's what we should
>> be reporting...even if it includes other things that we did not want
>> previously, like the time spent by the OS in swapping memory, etc.
>> [Papers I've read with graphs that have "time" on the y-axis and
>> "processors" on the x-axis rarely mention what time they are
>> measuring...but it seems obviously now that it must be real time
>> since user time should [???] increase with more processors.....I
>> think...of course, assuming we can total the user time across
>> machines accurately.]
>> Thank you for your message(s)! Think I got it now... :-)
>> Ray
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]