Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [OMPI svn] svn:open-mpi r25323
From: George Bosilca (bosilca_at_[hidden])
Date: 2011-10-19 18:48:30


There are several OPAL level error codes not used in the current code.

OPAL_ERR_TOPO_SLOT_LIST_NOT_SUPPORTED
OPAL_ERR_TOPO_SOCKET_NOT_SUPPORTED
OPAL_ERR_TOPO_CORE_NOT_SUPPORTED
OPAL_ERR_NOT_ENOUGH_SOCKETS
OPAL_ERR_NOT_ENOUGH_CORES
OPAL_ERR_INVALID_PHYS_CPU
OPAL_ERR_MULTIPLE_AFFINITIES

If somebody feels like filling up an RFC to remove them, please feel free to go ahead.

  george.

On Oct 19, 2011, at 18:41 , George Bosilca wrote:

> A careful reading of the committed patch, would have pointed out that none of the concerns raised so far were true, the "old-way" behavior of the OMPI code was preserved. Moreover, every single of the error codes removed were not used in ages.
>
> What Brian pointed out as evil, evil being a subjective notion by itself, didn't prevent the correct behavior of the code, nor affected in any way it's correctness. Anyway, to address his concern I pushed a patch (25333) putting the OMPI error codes back where they were originally.
>
> In other words we spent a very unproductive day, arguing over unfounded arguments and "thought-to-be" behaviors.
>
> george.
>
>
> On Oct 19, 2011, at 17:50 , Barrett, Brian W wrote:
>
>> George -
>>
>> I wrote the error code gorp; I'm pretty sure I know exactly how it was
>> supposed to work.
>>
>> There are 58 codes unused between OPAL_NETWORK_NOT_PARSEABLE and
>> OPAL_ERR_MAX. I now see what you did with ERR_REQUEST, and it's evil.
>> THat's not the intent of the error code logic at all. If you want to
>> change that, I'm not necessarily opposed to it, but that's something that
>> should be discussed in an RFC. What the current code does is not
>> consistent with the original intent.
>>
>> I don't agree that you shouldn't propagate error codes through OMPI; in
>> fact, the original intent of the design was to allow such propagation.
>> Again, such a change should be discussed as part of an RFC.
>>
>> Brian
>>
>> On 10/19/11 4:50 PM, "George Bosilca" <bosilca_at_[hidden]> wrote:
>>
>>> I don't know how you think that the error codes work in Open MPI, so I'll
>>> take the liberty to depict it here so we all agree we're talking about
>>> the same thing.
>>>
>>> The opal_strerror is a nice feature, it allow to register a range of
>>> error codes with a particular error converter. Every time you look for
>>> the meaning of a particular error code, the first convertor with a range
>>> enveloping the looked at value, will translate it into an error string.
>>>
>>> This is only currently used by OPAL and ORTE directly. It worked at the
>>> OMPI level only because we mapped __all__ OMPI errors to OPAL or ORTE
>>> ones. This behavior didn't change after my patch, you can still use
>>> opal_strerror to get the error string for all OPAL/ORTE/OMPI errors.
>>>
>>> There is a small "variation" for OMPI_ERR_REQUEST, the only really OMPI
>>> specific error code today. The OMPI error codes are actually inserted
>>> between the OPAL and the ORTE ones (there is a gap of 100 elements), so
>>> there is __no__ possible overlap right now. If at one point we add more
>>> than 100 OMPI level, we should certainly revisit this.
>>>
>>> Now, resulting from my patch, there is a difference. One should not
>>> simply forward an ORTE code into the stack of OMPI, and hope it just
>>> works. Errors should be dealt with where they happens, and if not
>>> possible they should be translated into the actual layer error code. The
>>> error propagation should be compartmentalized, and has to be translated
>>> into an error code that has a meaning at the OMPI level. The current
>>> patch should not prevent the mixed error-code code to work, as
>>> opal_strerror retains the same behavior as before. However, this coding
>>> practice should be avoided. I tried to clean the current code of such
>>> instances few days ago in r25230.
>>>
>>> Moreover, this is similar to how we deal with the error codes between
>>> OMPI and MPI layers, and seems like a sane way to compose libraries. You
>>> deal with a specific layer error code when you get it (usually after the
>>> call to a function from that specific layer), not later on when you don't
>>> even know exactly what the execution path was.
>>>
>>> george.
>>>
>>> PS: I'll fix the +/- issue.
>>>
>>> On Oct 19, 2011, at 14:09 , Jeff Squyres wrote:
>>>
>>>> Oy, yes, that is bad -- we cannot have overlapping ORTE and OMPI error
>>>> codes. That seems like a very bad idea (in addition to the mixing of +
>>>> and -).
>>>>
>>>> For one thing, that breaks opal_strerror(). That, in itself, seems
>>>> like a dealbreaker.
>>>>
>>>>
>>>> On Oct 19, 2011, at 1:51 PM, Barrett, Brian W wrote:
>>>>
>>>>> I actually think it's worse than that. An ORTE error code can now have
>>>>> the same error code as an OMPI error. OMPI_ERR_REQUEST and
>>>>> ORTE_ERR_RECV_LESS_THANK_POSTED now share the same integer return code.
>>>>> Or, they should, if George hadn't made a mistake (see below). The
>>>>> sharing
>>>>> of return codes seems... bad.
>>>>>
>>>>> Also, there's a bug in George's patch. Error codes are all negative,
>>>>> so
>>>>> OMPI_ERR_REQUEST should be OMPI_ERR_BASE -1 and OMPI_ERR_MAX should be
>>>>> OMPI_ERR_BASE - 1, not plus 2.
>>>>>
>>>>> Brian
>>>>>
>>>>> On 10/19/11 1:32 PM, "Ralph Castain" <rhc_at_[hidden]> wrote:
>>>>>
>>>>>> I've been wrestling with something from this commit, and I'm unsure of
>>>>>> the right answer. So please consider this a general design question
>>>>>> for
>>>>>> the community.
>>>>>>
>>>>>> This commit removes all the OMPI <-> ORTE equivalent constants -
>>>>>> i.e., we
>>>>>> used to declare OMPI-prefixed equivalents to every ORTE-prefixed
>>>>>> constant. I understand the thinking (or at least, what I suspect was
>>>>>> the
>>>>>> thought), but it creates an issue.
>>>>>>
>>>>>> Suppose I have an ompi-level function (A) that calls another
>>>>>> ompi-level
>>>>>> function (B). Invisible to A is that B calls an orte-level function. B
>>>>>> dutifully checks the error return from the orte-level function
>>>>>> against an
>>>>>> ORTE-prefixed constant.
>>>>>>
>>>>>> However, if that return isn't "success", what does B return up to A?
>>>>>> It
>>>>>> cannot return the OMPI equivalent to the orte error constant because
>>>>>> it
>>>>>> no longer exists. It could return the orte error code, but A has no
>>>>>> way
>>>>>> of knowing it is going to get a non-OMPI constant, and therefore
>>>>>> won't be
>>>>>> able to understand it - it will be an "unrecognized error".
>>>>>>
>>>>>> I guess one option is to require that B "translate" the return code
>>>>>> and
>>>>>> pass some OMPI error up the chain, but this prevents anything upwards
>>>>>> from understanding the nature of the problem and potentially taking
>>>>>> corrective and/or alternative action. Seems awfully limiting, as most
>>>>>> of
>>>>>> the time the only option will be the vanilla "OMPI_ERROR".
>>>>>>
>>>>>> Thoughts?
>>>>> --
>>>>> Brian W. Barrett
>>>>> Dept. 1423: Scalable System Software
>>>>> Sandia National Laboratories
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>>> --
>>>> Jeff Squyres
>>>> jsquyres_at_[hidden]
>>>> For corporate legal information go to:
>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>
>>
>> --
>> Brian W. Barrett
>> Dept. 1423: Scalable System Software
>> Sandia National Laboratories
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel