On Oct 19, 2011, at 2:50 PM, George Bosilca wrote:
> I don't know how you think that the error codes work in Open MPI, so I'll take the liberty to depict it here so we all agree we're talking about the same thing.
> The opal_strerror is a nice feature, it allow to register a range of error codes with a particular error converter. Every time you look for the meaning of a particular error code, the first convertor with a range enveloping the looked at value, will translate it into an error string.
> This is only currently used by OPAL and ORTE directly. It worked at the OMPI level only because we mapped __all__ OMPI errors to OPAL or ORTE ones. This behavior didn't change after my patch, you can still use opal_strerror to get the error string for all OPAL/ORTE/OMPI errors.
> There is a small "variation" for OMPI_ERR_REQUEST, the only really OMPI specific error code today. The OMPI error codes are actually inserted between the OPAL and the ORTE ones (there is a gap of 100 elements), so there is __no__ possible overlap right now. If at one point we add more than 100 OMPI level, we should certainly revisit this.
> Now, resulting from my patch, there is a difference. One should not simply forward an ORTE code into the stack of OMPI, and hope it just works. Errors should be dealt with where they happens, and if not possible they should be translated into the actual layer error code. The error propagation should be compartmentalized, and has to be translated into an error code that has a meaning at the OMPI level. The current patch should not prevent the mixed error-code code to work, as opal_strerror retains the same behavior as before. However, this coding practice should be avoided. I tried to clean the current code of such instances few days ago in r25230.
> Moreover, this is similar to how we deal with the error codes between OMPI and MPI layers, and seems like a sane way to compose libraries. You deal with a specific layer error code when you get it (usually after the call to a function from that specific layer), not later on when you don't even know exactly what the execution path was.
I'll have to ponder your logic. Not saying I disagree, but it would have been much nicer if you had explained your intended purpose in an RFC before imposing such a philosophy.
We were passing error codes up the ladder to allow higher levels to take corrective action that might extend beyond the scope of the immediate code - e.g., it might lead someone to use a specific different component in the framework if they knew that the RML was no longer working. We have lost that ability now, though we can regain it by defining OMPI error codes that don't equate to ORTE values, but retain the same meaning - and then translating as required. Not sure what that buys us, but maybe it will make some people feel better.
> PS: I'll fix the +/- issue.
> On Oct 19, 2011, at 14:09 , Jeff Squyres wrote:
>> Oy, yes, that is bad -- we cannot have overlapping ORTE and OMPI error codes. That seems like a very bad idea (in addition to the mixing of + and -).
>> For one thing, that breaks opal_strerror(). That, in itself, seems like a dealbreaker.
>> On Oct 19, 2011, at 1:51 PM, Barrett, Brian W wrote:
>>> I actually think it's worse than that. An ORTE error code can now have
>>> the same error code as an OMPI error. OMPI_ERR_REQUEST and
>>> ORTE_ERR_RECV_LESS_THANK_POSTED now share the same integer return code.
>>> Or, they should, if George hadn't made a mistake (see below). The sharing
>>> of return codes seems... bad.
>>> Also, there's a bug in George's patch. Error codes are all negative, so
>>> OMPI_ERR_REQUEST should be OMPI_ERR_BASE -1 and OMPI_ERR_MAX should be
>>> OMPI_ERR_BASE - 1, not plus 2.
>>> On 10/19/11 1:32 PM, "Ralph Castain" <rhc_at_[hidden]> wrote:
>>>> I've been wrestling with something from this commit, and I'm unsure of
>>>> the right answer. So please consider this a general design question for
>>>> the community.
>>>> This commit removes all the OMPI <-> ORTE equivalent constants - i.e., we
>>>> used to declare OMPI-prefixed equivalents to every ORTE-prefixed
>>>> constant. I understand the thinking (or at least, what I suspect was the
>>>> thought), but it creates an issue.
>>>> Suppose I have an ompi-level function (A) that calls another ompi-level
>>>> function (B). Invisible to A is that B calls an orte-level function. B
>>>> dutifully checks the error return from the orte-level function against an
>>>> ORTE-prefixed constant.
>>>> However, if that return isn't "success", what does B return up to A? It
>>>> cannot return the OMPI equivalent to the orte error constant because it
>>>> no longer exists. It could return the orte error code, but A has no way
>>>> of knowing it is going to get a non-OMPI constant, and therefore won't be
>>>> able to understand it - it will be an "unrecognized error".
>>>> I guess one option is to require that B "translate" the return code and
>>>> pass some OMPI error up the chain, but this prevents anything upwards
>>>> from understanding the nature of the problem and potentially taking
>>>> corrective and/or alternative action. Seems awfully limiting, as most of
>>>> the time the only option will be the vanilla "OMPI_ERROR".
>>> Brian W. Barrett
>>> Dept. 1423: Scalable System Software
>>> Sandia National Laboratories
>>> devel mailing list
>> Jeff Squyres
>> For corporate legal information go to:
>> devel mailing list
> devel mailing list