Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: MCA param registration errors
From: Tim Mattox (timattox_at_[hidden])
Date: 2011-11-03 10:14:48


Brian,
I thought the OPAL_SOS stuff was supposed to be the way to fix this?
https://svn.open-mpi.org/trac/ompi/wiki/ErrorMessages

Has that effort faded or not worked?

On Wed, Nov 2, 2011 at 2:22 PM, Barrett, Brian W <bwbarre_at_[hidden]> wrote:
> To be honest, I don't care so much today, I'm just fighting so that the
> output doesn't get worse.  At some point, we do need to figure out a
> better way of dealing with error messages, but not today :).
>
> Brian
>
> On 11/2/11 11:53 AM, "Ralph Castain" <rhc_at_[hidden]> wrote:
>
>>Hmmm....since it was my bug that surfaced the problem, maybe the best
>>answer is to just return an error code. I'll slowly work thru the param
>>registrations in ORTE and make them all check the return code. I'm
>>willing to look at OPAL as I go, but someone else will have to deal with
>>the OMPI layer.
>>
>>I don't know how to entirely avoid the message issue Brian mentions -
>>I'll still have to say -something- when I get an error code, but I have
>>come up with some methods for reducing the clutter.
>>
>>On Nov 2, 2011, at 11:43 AM, Barrett, Brian W wrote:
>>
>>> I really don't like our show_help at every level behavior (look at what
>>> happens when MPI_INIT fails, you get a page per process of the same
>>>error
>>> message from each level of the call stack).  If you want to show_help
>>>and
>>> abort on debug, that makes sense.  It doesn't make any sense on a
>>> production build.  Return an error code and let the upper layer deal
>>>with
>>> it.
>>>
>>> Brian
>>>
>>> On 11/2/11 11:27 AM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
>>>
>>>> Brian: you were the one that had an allergic reaction to #1 on the
>>>>call.
>>>>
>>>> Thoughts?
>>>>
>>>>
>>>> On Nov 2, 2011, at 1:23 PM, George Bosilca wrote:
>>>>
>>>>> As it has been said, this is not something supposed to make it in a
>>>>> release. On the unfortunate case where it does, always having a
>>>>> show_help will ensure a quick complaint on one of our mailing lists
>>>>>and
>>>>> increase the probability of a [very] quick fix.
>>>>>
>>>>>  george.
>>>>>
>>>>> On Nov 2, 2011, at 06:26 , TERRY DONTJE wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On 11/1/2011 7:48 PM, Jeff Squyres wrote:
>>>>>>> So this was slightly different than the opinion that was discussed
>>>>>>>on
>>>>>>> the call today, which was 2.  The rationale for #2 was to punish
>>>>>>> developers, but if such a bug did make it through to production,
>>>>>>>users
>>>>>>> wouldn't be annoyed with show_help messages all the time.
>>>>>>>
>>>>>>> Does anyone have strong opinions here?  I don't.
>>>>>>>
>>>>>>> I offer the following two points:
>>>>>>>
>>>>>>> - this is a coding error on the OMPI developer
>>>>>>> - it's pretty rare
>>>>>>>
>>>>>>>
>>>>>> I think a show_help + return is very helpful in this case.  I
>>>>>>wouldn't
>>>>>> think that we'd run into this case that much and it would seem that
>>>>>>it
>>>>>> would be a rare occurance that one could just fix when they run into
>>>>>> it.  However, since there was some opposition to having show_help
>>>>>> messages possibly coming up all over the place I     thought a fall
>>>>>> back of only doing the show_help on enable_debug builds was a
>>>>>> reasonable middle ground.
>>>>>>
>>>>>> --td
>>>>>>> On Nov 1, 2011, at 7:30 PM, George Bosilca wrote:
>>>>>>>
>>>>>>>
>>>>>>>> 1
>>>>>>>>
>>>>>>>> george.
>>>>>>>>
>>>>>>>> On Nov 1, 2011, at 17:23 , Jeff Squyres wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> Can you clarify -- I can parse your text multiple ways.  Which are
>>>>>>>>> you voting for?
>>>>>>>>>
>>>>>>>>> 1. show_help + return error code in all cases.
>>>>>>>>> 2. if OPAL_ENABLE_DEBUG, show_help + exit(1), else silently return
>>>>>>>>> error code.
>>>>>>>>> 3. show_help.  if OPAL_ENABLE_DEBUG, exit(1), else return error
>>>>>>>>> code.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Nov 1, 2011, at 4:50 PM, George Bosilca wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> This is a much saner solution. We [mostly] stayed away from
>>>>>>>>>> calling exit deep into our libraries, there is no reason to add
>>>>>>>>>>it
>>>>>>>>>> now. I'll vote in favor of show_help + return code.
>>>>>>>>>>
>>>>>>>>>> george.
>>>>>>>>>>
>>>>>>>>>> On Nov 1, 2011, at 15:14 , Jeff Squyres wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> We talked about this on the call today.
>>>>>>>>>>>
>>>>>>>>>>> A good suggestion was made: call show_help/opal_finalize/exit
>>>>>>>>>>> only when OPAL_ENABLE_DEBUG is true.  Otherwise, return an error
>>>>>>>>>>> code.
>>>>>>>>>>>
>>>>>>>>>>> If no one objects to this, I'll commit this tomorrow.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Oct 31, 2011, at 4:16 PM, Jeff Squyres wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> WHAT: what to do if registering an MCA param results in an
>>>>>>>>>>>>error?
>>>>>>>>>>>>
>>>>>>>>>>>> WHERE: opal/mca/base/mca_base_param.c
>>>>>>>>>>>>
>>>>>>>>>>>> WHY: MCA param re-registration issues should be treated as OMPI
>>>>>>>>>>>> developer errors
>>>>>>>>>>>>
>>>>>>>>>>>> WHEN: COB Friday, 4 Nov 2011
>>>>>>>>>>>>
>>>>>>>>>>>> -----------------
>>>>>>>>>>>>
>>>>>>>>>>>> Short version:
>>>>>>>>>>>>
>>>>>>>>>>>> Re-registering an MCA param to be a different type (e.g., it
>>>>>>>>>>>>was
>>>>>>>>>>>> initially registered to be a string, but was later
>>>>>>>>>>>>re-registered
>>>>>>>>>>>> to be an int) should be treated as an OMPI developer error, and
>>>>>>>>>>>> should opal_finalize()/exit(1).
>>>>>>>>>>>>
>>>>>>>>>>>> More details:
>>>>>>>>>>>>
>>>>>>>>>>>> A mistaken MCA param re-registration recently caused an orted
>>>>>>>>>>>> segv.
>>>>>>>>>>>>
>>>>>>>>>>>> The MCA param subsystem was fixed to avoid this segv, but
>>>>>>>>>>>> silently convert the MCA param to the newly-registered type.
>>>>>>>>>>>> Upon reflection and some discussion, this seems to be a bad
>>>>>>>>>>>>idea.
>>>>>>>>>>>> Instead, we should loudly complain via a show_help message and
>>>>>>>>>>>> then exit(1).
>>>>>>>>>>>>
>>>>>>>>>>>> Specifically: this kind of behavior is clearly an error and
>>>>>>>>>>>> should be fixed.  Unfortunately, in most cases, we don't
>>>>>>>>>>>>actually
>>>>>>>>>>>> check the return value from MCA param registration functions,
>>>>>>>>>>>>so
>>>>>>>>>>>> if we change the MCA param function to simply return a non
>>>>>>>>>>>> OPAL_SUCCESS status, it's unlikely that anyone will notice
>>>>>>>>>>>>until
>>>>>>>>>>>> some code tries to read the param value, likely still resulting
>>>>>>>>>>>> in a segv.
>>>>>>>>>>>>
>>>>>>>>>>>> Does anyone have heartburn if I change the error behavior to
>>>>>>>>>>>> opal_finalize()/exit(1)?
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Jeff Squyres
>>>>>>>>>>>>
>>>>>>>>>>>> jsquyres_at_[hidden]
>>>>>>>>>>>>
>>>>>>>>>>>> For corporate legal information go to:
>>>>>>>>>>>>
>>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>
>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>> --
>>>>>>>>>>> Jeff Squyres
>>>>>>>>>>>
>>>>>>>>>>> jsquyres_at_[hidden]
>>>>>>>>>>>
>>>>>>>>>>> For corporate legal information go to:
>>>>>>>>>>>
>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> devel mailing list
>>>>>>>>>>>
>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>>
>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>> --
>>>>>>>>> Jeff Squyres
>>>>>>>>>
>>>>>>>>> jsquyres_at_[hidden]
>>>>>>>>>
>>>>>>>>> For corporate legal information go to:
>>>>>>>>>
>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>>
>>>>>>>>> devel_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>>
>>>>>>>> devel_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>> --
>>>>>> <Mail Attachment.gif>
>>>>>> Terry D. Dontje | Principal Software Engineer
>>>>>> Developer Tools Engineering | +1.781.442.2631
>>>>>> Oracle - Performance Technologies
>>>>>> 95 Network Drive, Burlington, MA 01803
>>>>>> Email terry.dontje_at_[hidden]
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>>> --
>>>> Jeff Squyres
>>>> jsquyres_at_[hidden]
>>>> For corporate legal information go to:
>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>>  Brian W. Barrett
>>>  Dept. 1423: Scalable System Software
>>>  Sandia National Laboratories
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>>_______________________________________________
>>devel mailing list
>>devel_at_[hidden]
>>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>
>
> --
>  Brian W. Barrett
>  Dept. 1423: Scalable System Software
>  Sandia National Laboratories
>
>
>
>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

-- 
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
 timattox_at_[hidden] || tmattox_at_[hidden]
    I'm a bright... http://www.the-brights.net/