Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] BTL add procs errors
From: Ralph Castain (rhc_at_[hidden])
Date: 2010-05-27 08:48:46


On May 27, 2010, at 1:47 AM, Sylvain Jeaugey wrote:

> I don't think what the openib BTL is doing is that bad. It is returning an error because something really went bad in IB. So yes, it could blank the bitmask and return success, but would you really want IB to fail and fallback on TCP once in a while without any notice ?

As a sys admin - no, I would want to know it happened.

As a user - heck yeah! I don't care how the problem gets done, I just want the answer. It will probably take longer to complete, but that is better than having to start all over just because the cluster hiccups.

I believe this is what the notifier is intended to resolve.

> I wouldn't.
>
> So, as it seems that all "normal" problems can be handled through the reachable bitmask, it seems a good idea to me that BTLs returning errors
> make the application stop.
>
> Sylvain
>
> On Wed, 26 May 2010, Barrett, Brian W wrote:
>
>> George -
>>
>> I'm not sure I agree - the return code should indicate a failure beyond "something prohibited me from talking to the remote side" - something occurred that resulted in it being highly unlikely the app can successfully run to completion (such as malloc failing). On the other hand, I also think that the OpenIB BTL is probably doing the wrong thing - I can't imagine that the error returned reaches that state of badness, and it should probably zero out the bitmask and quietly return rather than try to cause the app to abort.
>>
>> Just my $0.02.
>>
>> Brian
>>
>>
>> On May 25, 2010, at 12:27 PM, George Bosilca wrote:
>>
>>> The BTLs are allowed to fail adding procs without major consequences in the short term. As you noticed each BTL returns a bit mask array containing all procs reachable through this particular instance of the BTL. Later (in the same file line 395) we check for the complete coverage for all procs, and only complain if one of the peers is unreachable.
>>>
>>> If you replace the continue statement by a return, we will never give a chance to the other BTLs and we will complain about lack of connectivity as soon as one BTL fails (for some reasons). Without talking about the fact that all the eager, send and rmda endpoint arrays will not be built.
>>>
>>> george.
>>>
>>> On May 25, 2010, at 05:10 , Sylvain Jeaugey wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm currently trying to have Open MPI exit more gracefully when a BTL returns an error during the "add procs" phase.
>>>>
>>>> The current bml/r2 code silently ignores btl->add_procs() error codes with the following comment :
>>>> ---- ompi/mca/bml/r2/bml_r2.c:208 ----
>>>> /* This BTL has troubles adding the nodes. Let's continue maybe some other BTL
>>>> * can take care of this task. */
>>>> continue;
>>>> --------------------------------------
>>>>
>>>> This seems wrong to me : either a proc is reached (the "reachable" bit field is therefore updated), either it is not (and nothing is done). Any error code should denote a fatal error needing a clean abort.
>>>>
>>>> In the current openib btl code, the "reachable" bit is set but an error is returned - then ignored by r2. The next call to the openib BTL results in a segmentation fault.
>>>>
>>>> So, maybe this simple fix would do the trick :
>>>> ========================================================================
>>>> diff -r 96e0793d7885 ompi/mca/bml/r2/bml_r2.c
>>>> --- a/ompi/mca/bml/r2/bml_r2.c Wed May 19 14:35:27 2010 +0200
>>>> +++ b/ompi/mca/bml/r2/bml_r2.c Tue May 25 10:54:19 2010 +0200
>>>> @@ -210,7 +210,7 @@
>>>> /* This BTL has troubles adding the nodes. Let's continue maybe some other BTL
>>>> * can take care of this task.
>>>> */
>>>> - continue;
>>>> + return rc;
>>>> }
>>>>
>>>> /* for each proc that is reachable */
>>>> ========================================================================
>>>>
>>>> Does anyone see a case (with a specific btl) where add_procs returns an error but we still want to continue ?
>>>>
>>>> Sylvain
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>> --
>> Brian W. Barrett
>> Dept. 1423: Scalable System Software
>> Sandia National Laboratories
>>
>>
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel