Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Gleb Natapov (glebn_at_[hidden])
Date: 2006-10-31 09:51:16


On Mon, Oct 30, 2006 at 11:45:53AM -0700, Troy Telford wrote:
> On Sun, 29 Oct 2006 01:34:06 -0700, Gleb Natapov <glebn_at_[hidden]>
> wrote:
>
> > If you use OB1 PML (default one) it will never recover from link down
> > error no matter how many other transports you have. The reason is that
> > OB1 never tracks what happens with buffers submitted to BTL. So if BTL
> > can't, for any reason, transmit packet passed to it by OB1 the job will
> > stuck because OB1 doesn't have enough information to try to resend the
> > packet via another transport. For this kind of resource tracking there
> > is DR PML. In case of IB BTL link down event generates error for each
> > packet submitted for sending to the device. IB BTL simply discards all
> > this packets and relies on PML to resend them, so even after link up
> > event a job will not recover if OB1 PML is used with IB BTL. This may be
> > different with another transports.
>
> This makes sense; one thing I'm wondering now is if the OB1 PML is able
> (and/or had enough information) to figure out that it can't continue at
> all, and will abort the job.

In case of openib BTL I don't see how job may recover from link down
event so I think aborting the job is the right thing to do. In case of
other transports if transport can continue after link up event as if
nothing happened it is worth to wait for link up. This capability may be
added to openib BTL too, it's just nobody cares enough.

--
			Gleb.