Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: libevent update
From: George Bosilca (bosilca_at_[hidden])
Date: 2008-03-18 17:19:49


Its like rewriting libevent from scratch. I guess it can be done, but
it will be a long and painful process. How about the following solution:

- the daemons are aware that the checkpointing is enabled. They can
set the environment variable which will force the opal_event_include
to be set to select.

- as the environment variables have a higher priority over the
configuration file, this will work on most cases (except when the user
add the mca parameter by hand).

- in the checkpoint/restart code, we can add a test that check the
value of opal_event_include, print a message if the value is not
select, and disable the checkpoint/restart functionality.

   george.

On Mar 18, 2008, at 4:59 PM, Jeff Squyres wrote:

> George added an MCA parameter for it (opal_event_include is a string
> that can be set to "select" or "poll"), but it has to be set before
> opal_init().
>
> Josh: could you try running with the MCA parameter opal_event_include
> set to "select"? This would confirm Brian's hypothesis...
>
> Given that opal_init() is the first thing that happens in
> ompi_mpi_init(), this may not be enough -- you could *detect* that we
> can't do BLCR, but this mechanism doesn't allow libmpi to set
> something saying "reset libevent to be able to only use select()."
>
> George -- is that hard to add? I would imagine that it could be kinda
> difficult to reset libevent after there are already users of it, fd's
> and other events that may have been added, etc...?
>
>
> On Mar 18, 2008, at 4:29 PM, Brian W. Barrett wrote:
>
>> Jeff / George -
>>
>> Did you add a way to specify which event modules are used? Because
>> epoll
>> pushs the socket list into the kernel, I can see how it would screw
>> up
>> BLCR. I bet everything would work if we forced the use of poll /
>> select.
>>
>> Brian
>>
>> On Tue, 18 Mar 2008, Jeff Squyres wrote:
>>
>>> Crud, ok. Keep us posted.
>>>
>>> On Mar 18, 2008, at 4:16 PM, Josh Hursey wrote:
>>>
>>>> I'm testing with checkpoint/restart and the new libevent seems to
>>>> be
>>>> messing up the checkpoints generated by BLCR. I'll be taking a look
>>>> at it over the next couple of days, but just thought I'd let people
>>>> know. Unfortunately I don't have any more details at the moment.
>>>>
>>>> -- Josh
>>>>
>>>> On Mar 17, 2008, at 2:50 PM, Jeff Squyres wrote:
>>>>
>>>>> WHAT: Bring new version of libevent to the trunk.
>>>>>
>>>>> WHY: Newer version, slightly better performance (lower overheads /
>>>>> lighter weight), properly integrate the use of epoll and other
>>>>> scalable fd monitoring mechanisms.
>>>>>
>>>>> WHERE: 98% of the changes are in opal/event; there's a few
>>>>> changes to
>>>>> configury and one change to the orted.
>>>>>
>>>>> TIMEOUT: COB, Friday, 21 March 2008
>>>>>
>>>>> DESCRIPTION:
>>>>>
>>>>> George/UTK has done the bulk of the work to integrate a new
>>>>> version
>>>>> of
>>>>> libevent on the following tmp branch:
>>>>>
>>>>> https://svn.open-mpi.org/svn/ompi/tmp-public/libevent-merge
>>>>>
>>>>> ** WE WOULD VERY MUCH APPRECIATE IF PEOPLE COULD MTT TEST THIS
>>>>> BRANCH!
>>>>> **
>>>>>
>>>>> Cisco ran MTT on this branch on Friday and everything checked out
>>>>> (i.e., no more failures than on the trunk). We just made a few
>>>>> more
>>>>> minor changes today and I'm running MTT again now, but I'm not
>>>>> expecting any new failures (MTT will take several hours). We
>>>>> would
>>>>> like to bring the new libevent in over this upcoming weekend, but
>>>>> would very much appreciate if others could test on their platforms
>>>>> (Cisco tests mainly 64 bit RHEL4U4). This new libevent *should*
>>>>> be a
>>>>> fairly side-effect free change, but it is possible that since
>>>>> we're
>>>>> now using epoll and other scalable fd monitoring tools, we'll run
>>>>> into
>>>>> some unanticipated issues on some platforms.
>>>>>
>>>>> Here's a consolidated diff if you want to see the changes:
>>>>>
>>>>> https://svn.open-mpi.org/trac/ompi/changeset?old_path=tmp-public%
>>>>> 2Flibevent-merge&old=17846&new_path=trunk&new=17842
>>>>>
>>>>> Thanks.
>>>>>
>>>>> --
>>>>> Jeff Squyres
>>>>> Cisco Systems
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



  • application/pkcs7-signature attachment: smime.p7s