Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Galen M. Shipman (gshipman_at_[hidden])
Date: 2006-10-31 10:29:38


Gleb Natapov wrote:

>On Mon, Oct 30, 2006 at 11:45:53AM -0700, Troy Telford wrote:
>
>
>>On Sun, 29 Oct 2006 01:34:06 -0700, Gleb Natapov <glebn_at_[hidden]>
>>wrote:
>>
>>
>>
>>>If you use OB1 PML (default one) it will never recover from link down
>>>error no matter how many other transports you have. The reason is that
>>>OB1 never tracks what happens with buffers submitted to BTL. So if BTL
>>>can't, for any reason, transmit packet passed to it by OB1 the job will
>>>stuck because OB1 doesn't have enough information to try to resend the
>>>packet via another transport. For this kind of resource tracking there
>>>is DR PML. In case of IB BTL link down event generates error for each
>>>packet submitted for sending to the device. IB BTL simply discards all
>>>this packets and relies on PML to resend them, so even after link up
>>>event a job will not recover if OB1 PML is used with IB BTL. This may be
>>>different with another transports.
>>>
>>>
>>This makes sense; one thing I'm wondering now is if the OB1 PML is able
>>(and/or had enough information) to figure out that it can't continue at
>>all, and will abort the job.
>>
>>
>
>In case of openib BTL I don't see how job may recover from link down
>event so I think aborting the job is the right thing to do. In case of
>other transports if transport can continue after link up event as if
>nothing happened it is worth to wait for link up. This capability may be
>added to openib BTL too, it's just nobody cares enough.
>
>
Ethernet doesn't fail in this case because the TCP stack handles this
gracefully. The same behavior can be observed when disconnecting an
ethernet cable while a ssh session exists, plug it back in and you are
probably good to go, after a bit of time (due to exponential backoff on
retrans). For GM/MX over myrinet the timeout is quite high on connection
down and the software stack handles this gracefully. For IB the link
state transitions from LinkActive to LinkActDefer until LinkDownTimeout
expires and the link transitions to LinkDown state.
 From the spec: LinkDownTimeout occurs when the port state machine has
continuously been in the LinkActDefer state for 10ms + 3% /-51% .. I
have no idea what that formula means, perhaps my pdf of the spec is
messed up.

So transitioning to the LinkDown state is dictated by the IB spec, it
would seem that we would want to defer the transition based on a user
configurable parameter, this is link layer so it would probably be
necessary to do this when loading the IB driver. Am I interpreting this
correctly?

- Galen

>--
> Gleb.
>_______________________________________________
>users mailing list
>users_at_[hidden]
>http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>