Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC 2/2: merge the OPAL SOS development branch into trunk
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2010-05-18 09:20:26


Abhishek and Jeff,

Awesome! Thanks for all your hard work maintaining and shepherding
this branch into the trunk.

-- Josh

On May 17, 2010, at 9:20 PM, Abhishek Kulkarni wrote:

>
> On May 14, 2010, at 12:24 PM, Josh Hursey wrote:
>
>>
>> On May 12, 2010, at 1:07 PM, Abhishek Kulkarni wrote:
>>
>>> Updated RFC (w/ discussed changes):
>>>
>>> =
>>> =
>>> ====================================================================
>>> [RFC 2/2] merge the OPAL SOS development branch into trunk
>>> =
>>> =
>>> ====================================================================
>>>
>>> WHAT: Merge the OPAL SOS development branch into the OMPI trunk.
>>>
>>> WHY: Bring over some of the work done to enhance error reporting
>>> capabilities.
>>>
>>> WHERE: opal/util/ and a few changes in the ORTE notifier.
>>>
>>> TIMEOUT: May 17, Monday, COB.
>>>
>>> REFERENCE BRANCHES: http://bitbucket.org/jsquyres/opal-sos-fixed/
>>>
>>> =
>>> =
>>> ====================================================================
>>>
>>> BACKGROUND:
>>>
>>> The OPAL SOS framework tries to meet the following objectives:
>>>
>>> - Reduce the cascading error messages and the amount of code
>>> needed to
>>> print an error message.
>>> - Build and aggregate stacks of encountered errors and associate
>>> related individual errors with each other.
>>> - Allow registration of custom callbacks to intercept error events.
>>>
>>> The SOS system provides an interface to log events of varying
>>> severities. These events are associated with an "encoded" error
>>> code
>>> which can be used to refer to stacks of SOS events. When logging
>>> events, they can also be transparently relayed to all the activated
>>> notifier components.
>>>
>>> The SOS system is described in detail on this wiki page:
>>>
>>> http://svn.open-mpi.org/trac/ompi/wiki/ErrorMessages
>>> https://svn.open-mpi.org/trac/ompi/attachment/wiki/ErrorMessages/OPAL_SOS.pdf
>>>
>>> CHANGES (since the last RFC):
>>>
>>> * Wrapped all hard-coded error-code checks (OMPI_ERR_* == ret),
>>> OPAL_SOS_GET_ERR_CODE(ret). There were about 30-40 such checks
>>> each in the OMPI and ORTE layer and about 15 in the OPAL layer.
>>> Since OPAL_SUCCESS is preserved by SOS, also changed calls of
>>> the form (OPAL_SUCCESS != ret) to (OPAL_ERROR == ret).
>>
>> You mean the other way around, right?
>> You changed code that previously looked like (OPAL_ERROR == ret) to
>> (OPAL_SUCCESS != ret) where appropriate.
>>
>
>
> Yes, thanks for the correction! This (and ORTE WDC) is all in trunk
> now -- I've split the changes into smaller patches (see commits
> r23155 - r23164) so that they are easier to sift through.
>
> Abhishek
>
>
>>>
>>> * If the error is an SOS-encoded error, ORTE_ERROR_LOG decodes
>>> the error, prints out the error stack and frees the errors.
>>>
>>> =
>>> =
>>> ====================================================================
>>>
>>>
>>> On Mar 29, 2010, at 10:58 AM, Abhishek Kulkarni wrote:
>>>
>>>>
>>>> =
>>>> =
>>>> =
>>>> ===================================================================
>>>> [RFC 2/2]
>>>> =
>>>> =
>>>> =
>>>> ===================================================================
>>>>
>>>> WHAT: Merge the OPAL SOS development branch into the OMPI trunk.
>>>>
>>>> WHY: Bring over some of the work done to enhance error reporting
>>>> capabilities.
>>>>
>>>> WHERE: opal/util/ and a few changes in the ORTE notifier.
>>>>
>>>> TIMEOUT: April 6, Wednesday, COB.
>>>>
>>>> REFERENCE BRANCHES: http://bitbucket.org/jsquyres/opal-sos-fixed/
>>>>
>>>> =
>>>> =
>>>> =
>>>> ===================================================================
>>>>
>>>> BACKGROUND:
>>>>
>>>> The OPAL SOS framework tries to meet the following objectives:
>>>>
>>>> - Reduce the cascading error messages and the amount of code
>>>> needed to
>>>> print an error message.
>>>> - Build and aggregate stacks of encountered errors and associate
>>>> related individual errors with each other.
>>>> - Allow registration of custom callbacks to intercept error events.
>>>>
>>>> The SOS system provides an interface to log events of varying
>>>> severities. These events are associated with an "encoded" error
>>>> code
>>>> which can be used to refer to stacks of SOS events. When logging
>>>> events, they can also be transparently relayed to all the activated
>>>> notifier components.
>>>>
>>>> The SOS system is described in detail on this wiki page:
>>>>
>>>> http://svn.open-mpi.org/trac/ompi/wiki/ErrorMessages
>>>>
>>>> Feel free to comment and/or provide suggestions.
>>>>
>>>> =
>>>> =
>>>> =
>>>> ===================================================================
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel