Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Galen M. Shipman (gshipman_at_[hidden])
Date: 2006-10-31 10:43:10


Galen M. Shipman wrote:

>Gleb Natapov wrote:
>
>
>
>>On Mon, Oct 30, 2006 at 11:45:53AM -0700, Troy Telford wrote:
>>
>>
>>
>>
>>>On Sun, 29 Oct 2006 01:34:06 -0700, Gleb Natapov <glebn_at_[hidden]>
>>>wrote:
>>>
>>>
>>>
>>>
>>>
>>>>If you use OB1 PML (default one) it will never recover from link down
>>>>error no matter how many other transports you have. The reason is that
>>>>OB1 never tracks what happens with buffers submitted to BTL. So if BTL
>>>>can't, for any reason, transmit packet passed to it by OB1 the job will
>>>>stuck because OB1 doesn't have enough information to try to resend the
>>>>packet via another transport. For this kind of resource tracking there
>>>>is DR PML. In case of IB BTL link down event generates error for each
>>>>packet submitted for sending to the device. IB BTL simply discards all
>>>>this packets and relies on PML to resend them, so even after link up
>>>>event a job will not recover if OB1 PML is used with IB BTL. This may be
>>>>different with another transports.
>>>>
>>>>
>>>>
>>>>
>>>This makes sense; one thing I'm wondering now is if the OB1 PML is able
>>>(and/or had enough information) to figure out that it can't continue at
>>>all, and will abort the job.
>>>
>>>
>>>
>>>
>>In case of openib BTL I don't see how job may recover from link down
>>event so I think aborting the job is the right thing to do. In case of
>>other transports if transport can continue after link up event as if
>>nothing happened it is worth to wait for link up. This capability may be
>>added to openib BTL too, it's just nobody cares enough.
>>
>>
>>
>>
>Ethernet doesn't fail in this case because the TCP stack handles this
>gracefully. The same behavior can be observed when disconnecting an
>ethernet cable while a ssh session exists, plug it back in and you are
>probably good to go, after a bit of time (due to exponential backoff on
>retrans). For GM/MX over myrinet the timeout is quite high on connection
>down and the software stack handles this gracefully. For IB the link
>state transitions from LinkActive to LinkActDefer until LinkDownTimeout
>expires and the link transitions to LinkDown state.
> From the spec: LinkDownTimeout occurs when the port state machine has
>continuously been in the LinkActDefer state for 10ms + 3% /-51% .. I
>have no idea what that formula means, perhaps my pdf of the spec is
>messed up.
>
>
Okay, so these are percentage not modulus, the formula makes some sense
now..
so the timeout is between 4.9 and 10.3 ms, you had better plug the cable
in/out very quickly ;-)

>So transitioning to the LinkDown state is dictated by the IB spec, it
>would seem that we would want to defer the transition based on a user
>configurable parameter, this is link layer so it would probably be
>necessary to do this when loading the IB driver. Am I interpreting this
>correctly?
>
>- Galen
>
>
>
>
>>--
>> Gleb.
>>_______________________________________________
>>users mailing list
>>users_at_[hidden]
>>http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>>
>
>_______________________________________________
>users mailing list
>users_at_[hidden]
>http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>