Leonardo Fialho wrote:
> NETDEV WATCHDOG: eth0: transmit timed out
> tg3: eth0: transmit timed out, resetting
> tg3: tg3_stop_block timed out, ofs=2c00 enable_bit=2
> tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2
> tg3: eth0: Link is down.
> tg3: eth0: Link is up at 1000 Mbps, full duplex.
The tg3 driver times out because the transmit is stuck. It can be an
interrupt problem or bad hardware flow-control on the switch. Since it
works after the driver resets the link, it looks like either the switch
flow control is busted (try to turn it off or try between 2 nodes in
back-to-back) or one other node stops consuming.
Open-MPI may generate enough contention to trigger the problem but I
don't think it is directly related to Open-MPI.