Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Durga Choudhury (dpchoudh_at_[hidden])
Date: 2006-10-26 18:15:55

As an alternate suggestion (although George's is better, since this will
affect your entire network connectivity), you could override the default TCP
timeout values with the "sysctl -w" command.
The following three OIDs affect TCP timeout behavior under Linux:

net.ipv4.tcp_keepalive_intvl = 75 <----- How often (in seconds) to send
keepalive probes
net.ipv4.tcp_keepalive_probes = 9 <----- How many probes to send before
declaring the connection dead.
net.ipv4.tcp_keepalive_time = 7200 <----- How long the connection may be
idle before the first keepalive is sent.

Again, use them with caution and not on a live internet server.


On 10/26/06, George Bosilca <bosilca_at_[hidden]> wrote:
> The Open MPI behavior is the same independently of the network used
> for the job. At least the behavior dictated by our internal message
> passing layer. But, for this to happens we should get a warning from
> the network that something is wrong (such a timeout). In the case of
> TCP (and Myrinet) the timeout is so high that Open MPI was not
> informed that something went wrong (we printout some warnings when
> this happens). It was happily waiting for a message to complete ...
> Once the network cable was reconnected, the network device itself
> recover and resume the communication, leading to a correct send
> operation (and this without involving Open MPI at all). There is
> nothing (that has a reasonable cost) we can do about this.
> For IB, look like the network timeout is smaller. Open MPI knew that
> something was wrong (the output prove it), and tried to continue
> using the other available devices. If none are available, then Open
> MPI is supposed to abort the job. For your particular run did you had
> Ethernet between the nodes ? If yes, I'm quite sure the MPI run
> wasn't stopped ... it continued using the TCP device (if not disabled
> by hand at mpirun time).
> That's not what is supposed to happens right now. If there are other
> devices (such as TCP) the MPI job will print out some warnings and
> will continue over the remaining networks (some will continue to use
> the other networks, only the peer where the network went down get
> affected). If the network timeout is too high, Open MPI will never
> notice that something went wrong. At least not the default message
> layer (PML).
> If you want to have the job abort when your main network goes down,
> disable the usage of the others available network. More specifically
> disable the TCP. A simple way to do it, it's to add the following
> argument to your mpirun command:
> --mca btl ^tcp (or --mca btl opnib,sm,self).
> Thanks,
> george.
> PS: There are several internal message passing modules available for
> Open MPI. The default one, look more for performance than
> reliability. If reliability it's what you need you should use the DR
> PML. For this, you can specify --mca pml dr at mpirun time. This (DR)
> PML has data reliability and timeout (Open MPI internal timeout that
> are configurable), allowing to recover faster from a network failure.
> On Oct 26, 2006, at 3:52 PM, Troy Telford wrote:
> > I've recently had the chance to see how Open MPI (as well as other
> > MPIs)
> > behave in the case of network failure.
> >
> > I've looked at what happens when a node has its network connection
> > disconnected in the middle of a job, with Ethernet, Myrinet (GM), and
> > InfiniBand (OpenIB).
> >
> > With Ethernet and Myrinet, the job more or less pauses until the
> > cable is
> > re-connected. (I imagine timeouts still apply, but I wasn't patient
> > enough to wait for them)
> >
> > With InfiniBand, the job pauses and Open MPI throws a few error
> > messages.
> > After the cable is plugged back in (and the SM catches up), the job
> > remains where it was when it was paused. I'd guess that part of
> > this is
> > that the timeout is much shorter with IB than with Myri or
> > Ethernet, and
> > that when I unplug the IB cable, it times out fairly quickly (and then
> > Open MPI throws its error messages).
> >
> > At any rate, the thought occurs (and it may just be my ignorance of
> > MPI):
> > After a network connection times out (as was apparently the case
> > with IB),
> > is the job salvageable? If the jobs are not salvageable, why
> > didn't Open
> > MPI abort the job (and clean up the running processes on the nodes)?
> > --
> > Troy Telford
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> >
> _______________________________________________
> users mailing list
> users_at_[hidden]

Devil wanted omnipresence;
He therefore created communists.