Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Troy Telford (ttelford.groups_at_[hidden])
Date: 2006-10-30 13:45:53

On Sun, 29 Oct 2006 01:34:06 -0700, Gleb Natapov <glebn_at_[hidden]>

> If you use OB1 PML (default one) it will never recover from link down
> error no matter how many other transports you have. The reason is that
> OB1 never tracks what happens with buffers submitted to BTL. So if BTL
> can't, for any reason, transmit packet passed to it by OB1 the job will
> stuck because OB1 doesn't have enough information to try to resend the
> packet via another transport. For this kind of resource tracking there
> is DR PML. In case of IB BTL link down event generates error for each
> packet submitted for sending to the device. IB BTL simply discards all
> this packets and relies on PML to resend them, so even after link up
> event a job will not recover if OB1 PML is used with IB BTL. This may be
> different with another transports.

This makes sense; one thing I'm wondering now is if the OB1 PML is able
(and/or had enough information) to figure out that it can't continue at
all, and will abort the job.

Troy Telford