Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Adding error/verbose messages to the TCP BTL
From: Ralph Castain (rhc_at_[hidden])
Date: 2010-03-05 14:59:13


On Mar 5, 2010, at 12:55 PM, Jeff Squyres wrote:

> On Mar 5, 2010, at 2:34 PM, George Bosilca wrote:
>
>> Being user friendly is good, being way too user friendly is less (but I guess this is the price we have to pay for a production-quality code isn't it).
>
> Agreed. None of these messages appear except in error cases or if you crank up the verbosity. The use case for this was a user (more than one, actually) who had problems with the TCP BTL deciding not to connect to peers for some reason. But there was no way to know exactly what the BTL was *trying* to do -- all you got was (effectively), "Sorry, I couldn't connect." So the main impetus for this was to give some visibility into what the TCP BTL is doing when it tries to connect -- you can see if it's trying to use private IP addresses by mistake, or somesuch.
>
>> I have few comments:
>>
>> - In several places you replaced the BTL_ERROR (which was the way BTLs are supposed to complaints) by a call directly to orte_show_help. This presents several inconveniences: drifting away from something more or less consistent across all BTLs, adding more dependencies between the BTLs and ORTE.
>
> I have never found BTL_ERROR to be terribly helpful. All it is is essentially an fprintf -- it doesn't propagate errors upward or anything. I tend to prefer show_help because then you can provide a meaningful error message that way -- and duplicate messages are not displayed (which many people have told me that they love that feature). BTL_ERROR just guarantees that the user will have to email us to figure out what's going on because the messages aren't meaningful to anyone other than an OMPI developer.

I'm not sure I understand this concern either, especially the latter one about orte dependency. There already are 5 calls to orte_show_help in this btl, along with several references to orte_process_info and other orte elements. What harm is done by adding more calls to orte_show_help?

I better understand the BTL_ERROR issue, but it raises the question as to whether BTL_ERROR should be defined as an orte_show_help call. That might help reduce the flood of duplicate messages when an error occurs.

>
>> - There are a lot of places where you just indented the code or split a medium-sized line into several lines. I find the code more difficult to read.
>
> Ja; I did re-intent some code because I found it hard to read the super-long lines while trying to figure out the TCP BTL code. Sorry about that.
>
> You do the same thing sometimes, too. ;-)
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel