On Thu, Oct 26, 2006 at 05:39:13PM -0600, Troy Telford wrote:
> I'm also confident that both TCP & Myrinet would throw an error when they
> time out; it's just that I haven't felt the need to verify it. (And with
> some-odd 20 minutes for Myrinet, it takes a bit of attention span. The
> last time I tried it I had forgotten about it for about 3-4 hours).
> > If none are available, then Open
> > MPI is supposed to abort the job. For your particular run did you had
> > Ethernet between the nodes ? If yes, I'm quite sure the MPI run
> > wasn't stopped ... it continued using the TCP device (if not disabled
> > by hand at mpirun time).
> This brings up an interesting question: The job was simply Intel's MPI
> benchmark (IMB), which is fairly chatty (ie. lots of screen output).
> On the first try, I used '--mca btl ^gm,^mx' to start the job. Ethernet
> was connected (eth0=10/100, eth1=gigabit), but after the IB cable was
> disconnected, everything stopped. The link lights (ethernet & IB) were
> not blinking, nor do any of the system monitors show much TCP traffic;
> certainly not the sort of traffic one would expect from an IMB run.
> I've also tried using --mca openib,sm,self,tcp (specifically adding TCP)
> and didn't see any sort of difference; the job still 'stuck' as soon as
> the IB cable was removed. I'll let that job continue to run overnight
> (ie. --mca btl tcp,openib,sm,self) to see if the job ever wakes up.
If you use OB1 PML (default one) it will never recover from link down
error no matter how many other transports you have. The reason is that
OB1 never tracks what happens with buffers submitted to BTL. So if BTL
can't, for any reason, transmit packet passed to it by OB1 the job will
stuck because OB1 doesn't have enough information to try to resend the
packet via another transport. For this kind of resource tracking there
is DR PML. In case of IB BTL link down event generates error for each
packet submitted for sending to the device. IB BTL simply discards all
this packets and relies on PML to resend them, so even after link up
event a job will not recover if OB1 PML is used with IB BTL. This may be
different with another transports.