Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: Change PML error handler signature
From: Rolf vandeVaart (rolf.vandevaart_at_[hidden])
Date: 2010-05-17 10:05:37


Hello Developers:

George and I talked some more about this change, and he has agreed that
it is OK. Therefore, I will be making this change sometime this week.

Rolf

On 04/23/10 11:47, George Bosilca wrote:
> The keyword here is consolidation. It's not about violating the initial design, it is more about keeping the design consistent. I think this change is good from an overall perspective, but now we have two ways to report similar types of problems. In other words, I don't see how we can fail sending or receiving a message from a peer, without having to recreate the connection to it. In same time, as we don't track the fragments in the PML, the fact that the BTL reports back an error on each fragment is good as it gives us the opportunity to know what exactly we have to redo.
>
> george.
>
> On Apr 21, 2010, at 13:48 , Rolf vandeVaart wrote:
>
>
>> Hi George:
>> To report that an entire BTL is down, one just sets the ompi_proc_t argument is set to NULL. That is how I was using it. That means the mca_pml_ob1_error_handler could see that it is NULL, and map out the entire BTL. BTLs can set the ompi_proc_t if they want and the PML is free to use or ignore it if it wants. This allows us to handle errors that may occur on a receive but that we would not want to error out the entire BTL, but just a single connection.
>>
>> Does that make this change better? Or am I still violating the general design.
>>
>> Rolf
>>
>> On 04/21/10 11:34, George Bosilca wrote:
>>
>>> The current error system follows a different design. There are basically two ways to report errors, per peer or global. The per-peer can only be triggered by a specific send or receive, and is based on the value of the last argument on the callbacks. Such errors, clearly indicated which is the peer and what is the message when such error have been detected. The second way is global, not peer related, and was supposed to be used more for local errors (such as this specific BTL is now down). As a result, this kind of errors is supposed to unlink all peers connected through the BTL, and this is why the ompi_proc_t is not part of the arguments list.
>>>
>>> If you change the signature of this function, this will change the design. And I'm not sure it make it more consistent. How do we report that a BTL is now completely down and all peers connected through it have to be relinked through another BTL?
>>>
>>> george.
>>>
>>> On Apr 21, 2010, at 11:07 , Rolf vandeVaart wrote:
>>>
>>>
>>>
>>>
>>>> WHAT:
>>>> Add two arguments to the mca_pml_ob1_error_handler to make it more useful for BTLs that may take advantage of that feature. Adding an ompi_proc_t pointer and a char pointer. This is what the new signature looks like.
>>>>
>>>> void mca_pml_ob1_error_handler(
>>>> struct mca_btl_base_module_t* btl,
>>>> int32_t flags, ompi_proc_t *errproc, char *btlname) {
>>>>
>>>> WHY:
>>>> There are times when the BTL wants to notify the PML not only that it had an error, but also the endpoint the error occurred on. In addition, we add a string so the BTL can put descriptive information like which interface had the error.
>>>>
>>>> WHERE: ompi/mca/pml/pml_ob1.c
>>>> ompi/mca/btl/openib/btl_openib_component.c
>>>>
>>>> MORE DETAILS:
>>>> I just want to expand the function signature by two variables. Not that currently the only place the callback is used is in the openib BTL. And when the callback is called, it just aborts the program. So this has no effect whatsoever on the current library. I will also fix the signature in the other PMLs to keep things consistent.
>>>> TIMEOUT: Monday, April 26, 2010 (as this is a minor change)
>>>> _______________________________________________
>>>> devel mailing list
>>>>
>>>>
>>>>