Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: MCA param registration errors
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-11-03 10:21:20


It was too complex, Tim. Nobody wound up using it, and so some of us have found a simpler alternative that seems to work, but hasn't been fully implemented across the code yet (we're just doing it as we go).

And nothing solves the problem of errors from every proc when you direct launch.

On Nov 3, 2011, at 8:14 AM, Tim Mattox wrote:

> Brian,
> I thought the OPAL_SOS stuff was supposed to be the way to fix this?
> https://svn.open-mpi.org/trac/ompi/wiki/ErrorMessages
>
> Has that effort faded or not worked?
>
> On Wed, Nov 2, 2011 at 2:22 PM, Barrett, Brian W <bwbarre_at_[hidden]> wrote:
>> To be honest, I don't care so much today, I'm just fighting so that the
>> output doesn't get worse. At some point, we do need to figure out a
>> better way of dealing with error messages, but not today :).
>>
>> Brian
>>
>> On 11/2/11 11:53 AM, "Ralph Castain" <rhc_at_[hidden]> wrote:
>>
>>> Hmmm....since it was my bug that surfaced the problem, maybe the best
>>> answer is to just return an error code. I'll slowly work thru the param
>>> registrations in ORTE and make them all check the return code. I'm
>>> willing to look at OPAL as I go, but someone else will have to deal with
>>> the OMPI layer.
>>>
>>> I don't know how to entirely avoid the message issue Brian mentions -
>>> I'll still have to say -something- when I get an error code, but I have
>>> come up with some methods for reducing the clutter.
>>>
>>> On Nov 2, 2011, at 11:43 AM, Barrett, Brian W wrote:
>>>
>>>> I really don't like our show_help at every level behavior (look at what
>>>> happens when MPI_INIT fails, you get a page per process of the same
>>>> error
>>>> message from each level of the call stack). If you want to show_help
>>>> and
>>>> abort on debug, that makes sense. It doesn't make any sense on a
>>>> production build. Return an error code and let the upper layer deal
>>>> with
>>>> it.
>>>>
>>>> Brian
>>>>
>>>> On 11/2/11 11:27 AM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
>>>>
>>>>> Brian: you were the one that had an allergic reaction to #1 on the
>>>>> call.
>>>>>
>>>>> Thoughts?
>>>>>
>>>>>
>>>>> On Nov 2, 2011, at 1:23 PM, George Bosilca wrote:
>>>>>
>>>>>> As it has been said, this is not something supposed to make it in a
>>>>>> release. On the unfortunate case where it does, always having a
>>>>>> show_help will ensure a quick complaint on one of our mailing lists
>>>>>> and
>>>>>> increase the probability of a [very] quick fix.
>>>>>>
>>>>>> george.
>>>>>>
>>>>>> On Nov 2, 2011, at 06:26 , TERRY DONTJE wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 11/1/2011 7:48 PM, Jeff Squyres wrote:
>>>>>>>> So this was slightly different than the opinion that was discussed
>>>>>>>> on
>>>>>>>> the call today, which was 2. The rationale for #2 was to punish
>>>>>>>> developers, but if such a bug did make it through to production,
>>>>>>>> users
>>>>>>>> wouldn't be annoyed with show_help messages all the time.
>>>>>>>>
>>>>>>>> Does anyone have strong opinions here? I don't.
>>>>>>>>
>>>>>>>> I offer the following two points:
>>>>>>>>
>>>>>>>> - this is a coding error on the OMPI developer
>>>>>>>> - it's pretty rare
>>>>>>>>
>>>>>>>>
>>>>>>> I think a show_help + return is very helpful in this case. I
>>>>>>> wouldn't
>>>>>>> think that we'd run into this case that much and it would seem that
>>>>>>> it
>>>>>>> would be a rare occurance that one could just fix when they run into
>>>>>>> it. However, since there was some opposition to having show_help
>>>>>>> messages possibly coming up all over the place I thought a fall
>>>>>>> back of only doing the show_help on enable_debug builds was a
>>>>>>> reasonable middle ground.
>>>>>>>
>>>>>>> --td
>>>>>>>> On Nov 1, 2011, at 7:30 PM, George Bosilca wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> 1
>>>>>>>>>
>>>>>>>>> george.
>>>>>>>>>
>>>>>>>>> On Nov 1, 2011, at 17:23 , Jeff Squyres wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Can you clarify -- I can parse your text multiple ways. Which are
>>>>>>>>>> you voting for?
>>>>>>>>>>
>>>>>>>>>> 1. show_help + return error code in all cases.
>>>>>>>>>> 2. if OPAL_ENABLE_DEBUG, show_help + exit(1), else silently return
>>>>>>>>>> error code.
>>>>>>>>>> 3. show_help. if OPAL_ENABLE_DEBUG, exit(1), else return error
>>>>>>>>>> code.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Nov 1, 2011, at 4:50 PM, George Bosilca wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> This is a much saner solution. We [mostly] stayed away from
>>>>>>>>>>> calling exit deep into our libraries, there is no reason to add
>>>>>>>>>>> it
>>>>>>>>>>> now. I'll vote in favor of show_help + return code.
>>>>>>>>>>>
>>>>>>>>>>> george.
>>>>>>>>>>>
>>>>>>>>>>> On Nov 1, 2011, at 15:14 , Jeff Squyres wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> We talked about this on the call today.
>>>>>>>>>>>>
>>>>>>>>>>>> A good suggestion was made: call show_help/opal_finalize/exit
>>>>>>>>>>>> only when OPAL_ENABLE_DEBUG is true. Otherwise, return an error
>>>>>>>>>>>> code.
>>>>>>>>>>>>
>>>>>>>>>>>> If no one objects to this, I'll commit this tomorrow.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Oct 31, 2011, at 4:16 PM, Jeff Squyres wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> WHAT: what to do if registering an MCA param results in an
>>>>>>>>>>>>> error?
>>>>>>>>>>>>>
>>>>>>>>>>>>> WHERE: opal/mca/base/mca_base_param.c
>>>>>>>>>>>>>
>>>>>>>>>>>>> WHY: MCA param re-registration issues should be treated as OMPI
>>>>>>>>>>>>> developer errors
>>>>>>>>>>>>>
>>>>>>>>>>>>> WHEN: COB Friday, 4 Nov 2011
>>>>>>>>>>>>>
>>>>>>>>>>>>> -----------------
>>>>>>>>>>>>>
>>>>>>>>>>>>> Short version:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Re-registering an MCA param to be a different type (e.g., it
>>>>>>>>>>>>> was
>>>>>>>>>>>>> initially registered to be a string, but was later
>>>>>>>>>>>>> re-registered
>>>>>>>>>>>>> to be an int) should be treated as an OMPI developer error, and
>>>>>>>>>>>>> should opal_finalize()/exit(1).
>>>>>>>>>>>>>
>>>>>>>>>>>>> More details:
>>>>>>>>>>>>>
>>>>>>>>>>>>> A mistaken MCA param re-registration recently caused an orted
>>>>>>>>>>>>> segv.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The MCA param subsystem was fixed to avoid this segv, but
>>>>>>>>>>>>> silently convert the MCA param to the newly-registered type.
>>>>>>>>>>>>> Upon reflection and some discussion, this seems to be a bad
>>>>>>>>>>>>> idea.
>>>>>>>>>>>>> Instead, we should loudly complain via a show_help message and
>>>>>>>>>>>>> then exit(1).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Specifically: this kind of behavior is clearly an error and
>>>>>>>>>>>>> should be fixed. Unfortunately, in most cases, we don't
>>>>>>>>>>>>> actually
>>>>>>>>>>>>> check the return value from MCA param registration functions,
>>>>>>>>>>>>> so
>>>>>>>>>>>>> if we change the MCA param function to simply return a non
>>>>>>>>>>>>> OPAL_SUCCESS status, it's unlikely that anyone will notice
>>>>>>>>>>>>> until
>>>>>>>>>>>>> some code tries to read the param value, likely still resulting
>>>>>>>>>>>>> in a segv.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Does anyone have heartburn if I change the error behavior to
>>>>>>>>>>>>> opal_finalize()/exit(1)?
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Jeff Squyres
>>>>>>>>>>>>>
>>>>>>>>>>>>> jsquyres_at_[hidden]
>>>>>>>>>>>>>
>>>>>>>>>>>>> For corporate legal information go to:
>>>>>>>>>>>>>
>>>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>
>>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>> --
>>>>>>>>>>>> Jeff Squyres
>>>>>>>>>>>>
>>>>>>>>>>>> jsquyres_at_[hidden]
>>>>>>>>>>>>
>>>>>>>>>>>> For corporate legal information go to:
>>>>>>>>>>>>
>>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>
>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> devel mailing list
>>>>>>>>>>>
>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>> --
>>>>>>>>>> Jeff Squyres
>>>>>>>>>>
>>>>>>>>>> jsquyres_at_[hidden]
>>>>>>>>>>
>>>>>>>>>> For corporate legal information go to:
>>>>>>>>>>
>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>>
>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>>
>>>>>>>>> devel_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>> --
>>>>>>> <Mail Attachment.gif>
>>>>>>> Terry D. Dontje | Principal Software Engineer
>>>>>>> Developer Tools Engineering | +1.781.442.2631
>>>>>>> Oracle - Performance Technologies
>>>>>>> 95 Network Drive, Burlington, MA 01803
>>>>>>> Email terry.dontje_at_[hidden]
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>>
>>>>> --
>>>>> Jeff Squyres
>>>>> jsquyres_at_[hidden]
>>>>> For corporate legal information go to:
>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Brian W. Barrett
>>>> Dept. 1423: Scalable System Software
>>>> Sandia National Laboratories
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>
>>
>> --
>> Brian W. Barrett
>> Dept. 1423: Scalable System Software
>> Sandia National Laboratories
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>
>
> --
> Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
> timattox_at_[hidden] || tmattox_at_[hidden]
> I'm a bright... http://www.the-brights.net/
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel