Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: [OMPI devel] RFC: Update libevent
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-05-01 10:38:54

WHAT: Update libevent to 2.0.19 release

WHEN: As soon as it is released, expected around May 11

WHY: The 2.0.19 release contains a critical fix to a bug I recently discovered in the libevent 2.0.x series

I discovered a bug in libevent over the last few days that causes it to unexpectedly "invert" event priorities. It is a slightly subtle bug, but we were able to provide a simple reproducer and so the libevent folks were able to quickly implement a fix.

Stated simply, if you were in an event of a given priority and activated an event of higher priority, that new event would not get serviced if any event of the current priority were to become active prior to leaving the current event. In other words, libevent would service all active events of the current priority before even looking to see if a higher priority event was active.

The patch adds the following logic to event_active:

> IF <I am in an event> AND
> IF <ev->base> EQ <current-base> AND
> IF <pri> LT <current-pri> THEN
> <rescan queues on next loop>

Thus, a rescan only occurs if a higher priority event becomes active during an event of lower priority. Unfortunately, ORTE relies on this behavior to handle errors - without the change, an error reported in a message from a daemon (for example) cannot be serviced until ALL messages that arrive during the processing of the message have been handled. In the case of a large cluster that is receiving a long list of messages, this prevents the error from being handled for quite some time.