Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [OMPI svn] svn:open-mpi r25323
From: George Bosilca (bosilca_at_[hidden])
Date: 2011-10-19 18:41:51

A careful reading of the committed patch, would have pointed out that none of the concerns raised so far were true, the "old-way" behavior of the OMPI code was preserved. Moreover, every single of the error codes removed were not used in ages.

What Brian pointed out as evil, evil being a subjective notion by itself, didn't prevent the correct behavior of the code, nor affected in any way it's correctness. Anyway, to address his concern I pushed a patch (25333) putting the OMPI error codes back where they were originally.

In other words we spent a very unproductive day, arguing over unfounded arguments and "thought-to-be" behaviors.


On Oct 19, 2011, at 17:50 , Barrett, Brian W wrote:

> George -
> I wrote the error code gorp; I'm pretty sure I know exactly how it was
> supposed to work.
> There are 58 codes unused between OPAL_NETWORK_NOT_PARSEABLE and
> OPAL_ERR_MAX. I now see what you did with ERR_REQUEST, and it's evil.
> THat's not the intent of the error code logic at all. If you want to
> change that, I'm not necessarily opposed to it, but that's something that
> should be discussed in an RFC. What the current code does is not
> consistent with the original intent.
> I don't agree that you shouldn't propagate error codes through OMPI; in
> fact, the original intent of the design was to allow such propagation.
> Again, such a change should be discussed as part of an RFC.
> Brian
> On 10/19/11 4:50 PM, "George Bosilca" <bosilca_at_[hidden]> wrote:
>> I don't know how you think that the error codes work in Open MPI, so I'll
>> take the liberty to depict it here so we all agree we're talking about
>> the same thing.
>> The opal_strerror is a nice feature, it allow to register a range of
>> error codes with a particular error converter. Every time you look for
>> the meaning of a particular error code, the first convertor with a range
>> enveloping the looked at value, will translate it into an error string.
>> This is only currently used by OPAL and ORTE directly. It worked at the
>> OMPI level only because we mapped __all__ OMPI errors to OPAL or ORTE
>> ones. This behavior didn't change after my patch, you can still use
>> opal_strerror to get the error string for all OPAL/ORTE/OMPI errors.
>> There is a small "variation" for OMPI_ERR_REQUEST, the only really OMPI
>> specific error code today. The OMPI error codes are actually inserted
>> between the OPAL and the ORTE ones (there is a gap of 100 elements), so
>> there is __no__ possible overlap right now. If at one point we add more
>> than 100 OMPI level, we should certainly revisit this.
>> Now, resulting from my patch, there is a difference. One should not
>> simply forward an ORTE code into the stack of OMPI, and hope it just
>> works. Errors should be dealt with where they happens, and if not
>> possible they should be translated into the actual layer error code. The
>> error propagation should be compartmentalized, and has to be translated
>> into an error code that has a meaning at the OMPI level. The current
>> patch should not prevent the mixed error-code code to work, as
>> opal_strerror retains the same behavior as before. However, this coding
>> practice should be avoided. I tried to clean the current code of such
>> instances few days ago in r25230.
>> Moreover, this is similar to how we deal with the error codes between
>> OMPI and MPI layers, and seems like a sane way to compose libraries. You
>> deal with a specific layer error code when you get it (usually after the
>> call to a function from that specific layer), not later on when you don't
>> even know exactly what the execution path was.
>> george.
>> PS: I'll fix the +/- issue.
>> On Oct 19, 2011, at 14:09 , Jeff Squyres wrote:
>>> Oy, yes, that is bad -- we cannot have overlapping ORTE and OMPI error
>>> codes. That seems like a very bad idea (in addition to the mixing of +
>>> and -).
>>> For one thing, that breaks opal_strerror(). That, in itself, seems
>>> like a dealbreaker.
>>> On Oct 19, 2011, at 1:51 PM, Barrett, Brian W wrote:
>>>> I actually think it's worse than that. An ORTE error code can now have
>>>> the same error code as an OMPI error. OMPI_ERR_REQUEST and
>>>> ORTE_ERR_RECV_LESS_THANK_POSTED now share the same integer return code.
>>>> Or, they should, if George hadn't made a mistake (see below). The
>>>> sharing
>>>> of return codes seems... bad.
>>>> Also, there's a bug in George's patch. Error codes are all negative,
>>>> so
>>>> OMPI_ERR_REQUEST should be OMPI_ERR_BASE -1 and OMPI_ERR_MAX should be
>>>> OMPI_ERR_BASE - 1, not plus 2.
>>>> Brian
>>>> On 10/19/11 1:32 PM, "Ralph Castain" <rhc_at_[hidden]> wrote:
>>>>> I've been wrestling with something from this commit, and I'm unsure of
>>>>> the right answer. So please consider this a general design question
>>>>> for
>>>>> the community.
>>>>> This commit removes all the OMPI <-> ORTE equivalent constants -
>>>>> i.e., we
>>>>> used to declare OMPI-prefixed equivalents to every ORTE-prefixed
>>>>> constant. I understand the thinking (or at least, what I suspect was
>>>>> the
>>>>> thought), but it creates an issue.
>>>>> Suppose I have an ompi-level function (A) that calls another
>>>>> ompi-level
>>>>> function (B). Invisible to A is that B calls an orte-level function. B
>>>>> dutifully checks the error return from the orte-level function
>>>>> against an
>>>>> ORTE-prefixed constant.
>>>>> However, if that return isn't "success", what does B return up to A?
>>>>> It
>>>>> cannot return the OMPI equivalent to the orte error constant because
>>>>> it
>>>>> no longer exists. It could return the orte error code, but A has no
>>>>> way
>>>>> of knowing it is going to get a non-OMPI constant, and therefore
>>>>> won't be
>>>>> able to understand it - it will be an "unrecognized error".
>>>>> I guess one option is to require that B "translate" the return code
>>>>> and
>>>>> pass some OMPI error up the chain, but this prevents anything upwards
>>>>> from understanding the nature of the problem and potentially taking
>>>>> corrective and/or alternative action. Seems awfully limiting, as most
>>>>> of
>>>>> the time the only option will be the vanilla "OMPI_ERROR".
>>>>> Thoughts?
>>>> --
>>>> Brian W. Barrett
>>>> Dept. 1423: Scalable System Software
>>>> Sandia National Laboratories
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>> --
>>> Jeff Squyres
>>> jsquyres_at_[hidden]
>>> For corporate legal information go to:
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
> --
> Brian W. Barrett
> Dept. 1423: Scalable System Software
> Sandia National Laboratories
> _______________________________________________
> devel mailing list
> devel_at_[hidden]