Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC 2/2: merge the OPAL SOS development branch into trunk
From: Abhishek Kulkarni (adkulkar_at_[hidden])
Date: 2010-05-17 21:20:21


On May 14, 2010, at 12:24 PM, Josh Hursey wrote:

>
> On May 12, 2010, at 1:07 PM, Abhishek Kulkarni wrote:
>
>> Updated RFC (w/ discussed changes):
>>
>> =
>> =====================================================================
>> [RFC 2/2] merge the OPAL SOS development branch into trunk
>> =
>> =====================================================================
>>
>> WHAT: Merge the OPAL SOS development branch into the OMPI trunk.
>>
>> WHY: Bring over some of the work done to enhance error reporting
>> capabilities.
>>
>> WHERE: opal/util/ and a few changes in the ORTE notifier.
>>
>> TIMEOUT: May 17, Monday, COB.
>>
>> REFERENCE BRANCHES: http://bitbucket.org/jsquyres/opal-sos-fixed/
>>
>> =
>> =====================================================================
>>
>> BACKGROUND:
>>
>> The OPAL SOS framework tries to meet the following objectives:
>>
>> - Reduce the cascading error messages and the amount of code needed
>> to
>> print an error message.
>> - Build and aggregate stacks of encountered errors and associate
>> related individual errors with each other.
>> - Allow registration of custom callbacks to intercept error events.
>>
>> The SOS system provides an interface to log events of varying
>> severities. These events are associated with an "encoded" error code
>> which can be used to refer to stacks of SOS events. When logging
>> events, they can also be transparently relayed to all the activated
>> notifier components.
>>
>> The SOS system is described in detail on this wiki page:
>>
>> http://svn.open-mpi.org/trac/ompi/wiki/ErrorMessages
>> https://svn.open-mpi.org/trac/ompi/attachment/wiki/ErrorMessages/OPAL_SOS.pdf
>>
>> CHANGES (since the last RFC):
>>
>> * Wrapped all hard-coded error-code checks (OMPI_ERR_* == ret),
>> OPAL_SOS_GET_ERR_CODE(ret). There were about 30-40 such checks
>> each in the OMPI and ORTE layer and about 15 in the OPAL layer.
>> Since OPAL_SUCCESS is preserved by SOS, also changed calls of
>> the form (OPAL_SUCCESS != ret) to (OPAL_ERROR == ret).
>
> You mean the other way around, right?
> You changed code that previously looked like (OPAL_ERROR == ret) to
> (OPAL_SUCCESS != ret) where appropriate.
>

Yes, thanks for the correction! This (and ORTE WDC) is all in trunk
now -- I've split the changes into smaller patches (see commits r23155
- r23164) so that they are easier to sift through.

Abhishek

>>
>> * If the error is an SOS-encoded error, ORTE_ERROR_LOG decodes
>> the error, prints out the error stack and frees the errors.
>>
>> =
>> =====================================================================
>>
>>
>> On Mar 29, 2010, at 10:58 AM, Abhishek Kulkarni wrote:
>>
>>>
>>> =
>>> =
>>> ====================================================================
>>> [RFC 2/2]
>>> =
>>> =
>>> ====================================================================
>>>
>>> WHAT: Merge the OPAL SOS development branch into the OMPI trunk.
>>>
>>> WHY: Bring over some of the work done to enhance error reporting
>>> capabilities.
>>>
>>> WHERE: opal/util/ and a few changes in the ORTE notifier.
>>>
>>> TIMEOUT: April 6, Wednesday, COB.
>>>
>>> REFERENCE BRANCHES: http://bitbucket.org/jsquyres/opal-sos-fixed/
>>>
>>> =
>>> =
>>> ====================================================================
>>>
>>> BACKGROUND:
>>>
>>> The OPAL SOS framework tries to meet the following objectives:
>>>
>>> - Reduce the cascading error messages and the amount of code
>>> needed to
>>> print an error message.
>>> - Build and aggregate stacks of encountered errors and associate
>>> related individual errors with each other.
>>> - Allow registration of custom callbacks to intercept error events.
>>>
>>> The SOS system provides an interface to log events of varying
>>> severities. These events are associated with an "encoded" error
>>> code
>>> which can be used to refer to stacks of SOS events. When logging
>>> events, they can also be transparently relayed to all the activated
>>> notifier components.
>>>
>>> The SOS system is described in detail on this wiki page:
>>>
>>> http://svn.open-mpi.org/trac/ompi/wiki/ErrorMessages
>>>
>>> Feel free to comment and/or provide suggestions.
>>>
>>> =
>>> =
>>> ====================================================================
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel