On Apr 13, 2009, at 10:05 , Rolf Vandevaart wrote:
> We also looking at mapping out a BTL when we get an error. We are
> going down the path of looking at registering a PML OB1 callback
> function that gets invoked when we get an error in the BTL. Then
> this PML OB1 callback function can map out the BTL via a call to
> mca_bml.bml_del_btl(btl) which seems to be doing the right thing.
There is already a PML functions (mca_pml_ob1_error_handler) that get
called when an error [not related to any message] is detected by the
BTL. However, the only thing this function does is calling abort.
> But, to make this all work requires changes to the PML OB1 layer.
I have another version of the PML that is way more resilient than the
one in the trunk. It is part of the fault tolerance work we're doing
here at UTK, but it wasn't expected to go in the trunk anytime soon ...
> We are also figuring out what we do for retransmission when we get
> an error.
There is some code for this. If the descriptor is for an RMA
operation, we simply transform it into a send over another BTL. Right
now, we're not dealing in OB1 with transmission failures for the match
and rendez-vous fragments.
> devel mailing list