Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: MCA param registration errors
From: Barrett, Brian W (bwbarre_at_[hidden])
Date: 2011-11-02 14:22:26


To be honest, I don't care so much today, I'm just fighting so that the
output doesn't get worse. At some point, we do need to figure out a
better way of dealing with error messages, but not today :).

Brian

On 11/2/11 11:53 AM, "Ralph Castain" <rhc_at_[hidden]> wrote:

>Hmmm....since it was my bug that surfaced the problem, maybe the best
>answer is to just return an error code. I'll slowly work thru the param
>registrations in ORTE and make them all check the return code. I'm
>willing to look at OPAL as I go, but someone else will have to deal with
>the OMPI layer.
>
>I don't know how to entirely avoid the message issue Brian mentions -
>I'll still have to say -something- when I get an error code, but I have
>come up with some methods for reducing the clutter.
>
>On Nov 2, 2011, at 11:43 AM, Barrett, Brian W wrote:
>
>> I really don't like our show_help at every level behavior (look at what
>> happens when MPI_INIT fails, you get a page per process of the same
>>error
>> message from each level of the call stack). If you want to show_help
>>and
>> abort on debug, that makes sense. It doesn't make any sense on a
>> production build. Return an error code and let the upper layer deal
>>with
>> it.
>>
>> Brian
>>
>> On 11/2/11 11:27 AM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
>>
>>> Brian: you were the one that had an allergic reaction to #1 on the
>>>call.
>>>
>>> Thoughts?
>>>
>>>
>>> On Nov 2, 2011, at 1:23 PM, George Bosilca wrote:
>>>
>>>> As it has been said, this is not something supposed to make it in a
>>>> release. On the unfortunate case where it does, always having a
>>>> show_help will ensure a quick complaint on one of our mailing lists
>>>>and
>>>> increase the probability of a [very] quick fix.
>>>>
>>>> george.
>>>>
>>>> On Nov 2, 2011, at 06:26 , TERRY DONTJE wrote:
>>>>
>>>>>
>>>>>
>>>>> On 11/1/2011 7:48 PM, Jeff Squyres wrote:
>>>>>> So this was slightly different than the opinion that was discussed
>>>>>>on
>>>>>> the call today, which was 2. The rationale for #2 was to punish
>>>>>> developers, but if such a bug did make it through to production,
>>>>>>users
>>>>>> wouldn't be annoyed with show_help messages all the time.
>>>>>>
>>>>>> Does anyone have strong opinions here? I don't.
>>>>>>
>>>>>> I offer the following two points:
>>>>>>
>>>>>> - this is a coding error on the OMPI developer
>>>>>> - it's pretty rare
>>>>>>
>>>>>>
>>>>> I think a show_help + return is very helpful in this case. I
>>>>>wouldn't
>>>>> think that we'd run into this case that much and it would seem that
>>>>>it
>>>>> would be a rare occurance that one could just fix when they run into
>>>>> it. However, since there was some opposition to having show_help
>>>>> messages possibly coming up all over the place I thought a fall
>>>>> back of only doing the show_help on enable_debug builds was a
>>>>> reasonable middle ground.
>>>>>
>>>>> --td
>>>>>> On Nov 1, 2011, at 7:30 PM, George Bosilca wrote:
>>>>>>
>>>>>>
>>>>>>> 1
>>>>>>>
>>>>>>> george.
>>>>>>>
>>>>>>> On Nov 1, 2011, at 17:23 , Jeff Squyres wrote:
>>>>>>>
>>>>>>>
>>>>>>>> Can you clarify -- I can parse your text multiple ways. Which are
>>>>>>>> you voting for?
>>>>>>>>
>>>>>>>> 1. show_help + return error code in all cases.
>>>>>>>> 2. if OPAL_ENABLE_DEBUG, show_help + exit(1), else silently return
>>>>>>>> error code.
>>>>>>>> 3. show_help. if OPAL_ENABLE_DEBUG, exit(1), else return error
>>>>>>>> code.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Nov 1, 2011, at 4:50 PM, George Bosilca wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> This is a much saner solution. We [mostly] stayed away from
>>>>>>>>> calling exit deep into our libraries, there is no reason to add
>>>>>>>>>it
>>>>>>>>> now. I'll vote in favor of show_help + return code.
>>>>>>>>>
>>>>>>>>> george.
>>>>>>>>>
>>>>>>>>> On Nov 1, 2011, at 15:14 , Jeff Squyres wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> We talked about this on the call today.
>>>>>>>>>>
>>>>>>>>>> A good suggestion was made: call show_help/opal_finalize/exit
>>>>>>>>>> only when OPAL_ENABLE_DEBUG is true. Otherwise, return an error
>>>>>>>>>> code.
>>>>>>>>>>
>>>>>>>>>> If no one objects to this, I'll commit this tomorrow.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Oct 31, 2011, at 4:16 PM, Jeff Squyres wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> WHAT: what to do if registering an MCA param results in an
>>>>>>>>>>>error?
>>>>>>>>>>>
>>>>>>>>>>> WHERE: opal/mca/base/mca_base_param.c
>>>>>>>>>>>
>>>>>>>>>>> WHY: MCA param re-registration issues should be treated as OMPI
>>>>>>>>>>> developer errors
>>>>>>>>>>>
>>>>>>>>>>> WHEN: COB Friday, 4 Nov 2011
>>>>>>>>>>>
>>>>>>>>>>> -----------------
>>>>>>>>>>>
>>>>>>>>>>> Short version:
>>>>>>>>>>>
>>>>>>>>>>> Re-registering an MCA param to be a different type (e.g., it
>>>>>>>>>>>was
>>>>>>>>>>> initially registered to be a string, but was later
>>>>>>>>>>>re-registered
>>>>>>>>>>> to be an int) should be treated as an OMPI developer error, and
>>>>>>>>>>> should opal_finalize()/exit(1).
>>>>>>>>>>>
>>>>>>>>>>> More details:
>>>>>>>>>>>
>>>>>>>>>>> A mistaken MCA param re-registration recently caused an orted
>>>>>>>>>>> segv.
>>>>>>>>>>>
>>>>>>>>>>> The MCA param subsystem was fixed to avoid this segv, but
>>>>>>>>>>> silently convert the MCA param to the newly-registered type.
>>>>>>>>>>> Upon reflection and some discussion, this seems to be a bad
>>>>>>>>>>>idea.
>>>>>>>>>>> Instead, we should loudly complain via a show_help message and
>>>>>>>>>>> then exit(1).
>>>>>>>>>>>
>>>>>>>>>>> Specifically: this kind of behavior is clearly an error and
>>>>>>>>>>> should be fixed. Unfortunately, in most cases, we don't
>>>>>>>>>>>actually
>>>>>>>>>>> check the return value from MCA param registration functions,
>>>>>>>>>>>so
>>>>>>>>>>> if we change the MCA param function to simply return a non
>>>>>>>>>>> OPAL_SUCCESS status, it's unlikely that anyone will notice
>>>>>>>>>>>until
>>>>>>>>>>> some code tries to read the param value, likely still resulting
>>>>>>>>>>> in a segv.
>>>>>>>>>>>
>>>>>>>>>>> Does anyone have heartburn if I change the error behavior to
>>>>>>>>>>> opal_finalize()/exit(1)?
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Jeff Squyres
>>>>>>>>>>>
>>>>>>>>>>> jsquyres_at_[hidden]
>>>>>>>>>>>
>>>>>>>>>>> For corporate legal information go to:
>>>>>>>>>>>
>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> devel mailing list
>>>>>>>>>>>
>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>> --
>>>>>>>>>> Jeff Squyres
>>>>>>>>>>
>>>>>>>>>> jsquyres_at_[hidden]
>>>>>>>>>>
>>>>>>>>>> For corporate legal information go to:
>>>>>>>>>>
>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>>
>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>>
>>>>>>>>> devel_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>> --
>>>>>>>> Jeff Squyres
>>>>>>>>
>>>>>>>> jsquyres_at_[hidden]
>>>>>>>>
>>>>>>>> For corporate legal information go to:
>>>>>>>>
>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>>
>>>>>>>> devel_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>>
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>> --
>>>>> <Mail Attachment.gif>
>>>>> Terry D. Dontje | Principal Software Engineer
>>>>> Developer Tools Engineering | +1.781.442.2631
>>>>> Oracle - Performance Technologies
>>>>> 95 Network Drive, Burlington, MA 01803
>>>>> Email terry.dontje_at_[hidden]
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>> --
>>> Jeff Squyres
>>> jsquyres_at_[hidden]
>>> For corporate legal information go to:
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>
>>>
>>>
>>
>>
>> --
>> Brian W. Barrett
>> Dept. 1423: Scalable System Software
>> Sandia National Laboratories
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>_______________________________________________
>devel mailing list
>devel_at_[hidden]
>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>

-- 
  Brian W. Barrett
  Dept. 1423: Scalable System Software
  Sandia National Laboratories