Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Change in OPAL / OMPI DPM system time during MPI_INIT
From: Barrett, Brian W (bwbarre_at_[hidden])
Date: 2010-11-22 14:47:28


Short answer: we need the "extra" decrement at the end of MPI init.

Long answer: Ok, so I was somewhat wrong :).

The count of users is initialized to 0. If it's greater than zero, the event library is polled every time opal_progress() is called, which kills latency (surprised this didn't show up in testing). It's really quite pointless to a runtime library or portability library to not poll the event library every time (particularly since the primary communication mechanisms in the runtime library use the event library), so opal_init() increases the counter to 1.

So by the time anything interesting in MPI_INIT happens, the counter is set to 1, and every call to opal_progress results in a call to the event library. The decrement in MPI_INIT was to "undo" the initialization increment, so that things would run fast from end of MPI_INIT to start of MPI_FINALIZE unless some other piece of OMPI knew it needed fast run-time interactions (such as the DPM or the TCP-based BTLs). Of course, during MPI_FINALIZE, we need to "undo" the go-fast options we changed during the end of MPI_INIT, which is why there's an increment early in finalize.

Brian

On Nov 22, 2010, at 12:27 PM, Jeff Squyres wrote:

> On Nov 22, 2010, at 11:35 AM, Barrett, Brian W wrote:
>
>> Um, the counter starts initialized at one.
>
> Does that mean that we should or should not leave that extra _decrement() in there?
>
>> Brian
>>
>> On Nov 22, 2010, at 9:32 AM, Jeff Squyres wrote:
>>
>>> A user noticed a specific change that we made between 1.4.2 and 1.4.3:
>>>
>>> https://svn.open-mpi.org/trac/ompi/changeset/23448
>>>
>>> which is from CMR https://svn.open-mpi.org/trac/ompi/ticket/2489, and originally from trunk https://svn.open-mpi.org/trac/ompi/changeset/23434. I removed the opal_progress_event_users_decrement() from ompi_mpi_init() because the ORTE DPM does its own _increment() and _decrement().
>>>
>>> However, it seems that there was an unintended consequence of this -- look at the annotated Ganglia graph that the user sent (see attached). In 1.4.2, all of the idle time was "user" CPU usage. In 1.4.3, it's split between user and system CPU usage. The application that he used to test is basically an init / finalize test (with some additional MPI middleware). See:
>>>
>>> http://www.open-mpi.org/community/lists/users/2010/11/14773.php
>>>
>>> Can anyone think of why this occurs, and/or if it's a Bad Thing?
>>>
>>> If removing this decrement enabled a bunch more system CPU time, that would seem to imply that we're calling libevent more frequently than we used to (vs. polling the opal event callbacks), and therefore that there might now be an unmatched increment somewhere.
>>>
>>> Right...?
>>>
>>> --
>>> Jeff Squyres
>>> jsquyres_at_[hidden]
>>> For corporate legal information go to:
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>> <openmpi143.jpeg><ATT00002..txt>
>>
>> --
>> Brian W. Barrett
>> Dept. 1423: Scalable System Software
>> Sandia National Laboratories
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

-- 
  Brian W. Barrett
  Dept. 1423: Scalable System Software
  Sandia National Laboratories