Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: George Bosilca (bosilca_at_[hidden])
Date: 2006-10-26 18:24:49

Moreover ... you have to have the admin right in order to modify
these parameters. If it's the case, there is a trick for MX too. One
can recompile it, with a different timeout (recompilation is required
as far as I remember). Grep for timeout in the MX sources and you
will find out how to do it. If you choose this path, be not cautious

In the case you don't want to alter these default arguments for TCP
and MX, or if you don't have admin rights, there is one and only one
solution possible ... the DR PML as explained in my previous email.


On Oct 26, 2006, at 6:15 PM, Durga Choudhury wrote:

> As an alternate suggestion (although George's is better, since this
> will affect your entire network connectivity), you could override
> the default TCP timeout values with the "sysctl -w" command.
> The following three OIDs affect TCP timeout behavior under Linux:
> net.ipv4.tcp_keepalive_intvl = 75 <----- How often (in seconds) to
> send keepalive probes
> net.ipv4.tcp_keepalive_probes = 9 <----- How many probes to send
> before declaring the connection dead.
> net.ipv4.tcp_keepalive_time = 7200 <----- How long the connection
> may be idle before the first keepalive is sent.
> Again, use them with caution and not on a live internet server.
> Durga
> On 10/26/06, George Bosilca <bosilca_at_[hidden]> wrote: The Open
> MPI behavior is the same independently of the network used
> for the job. At least the behavior dictated by our internal message
> passing layer. But, for this to happens we should get a warning from
> the network that something is wrong (such a timeout). In the case of
> TCP (and Myrinet) the timeout is so high that Open MPI was not
> informed that something went wrong (we printout some warnings when
> this happens). It was happily waiting for a message to complete ...
> Once the network cable was reconnected, the network device itself
> recover and resume the communication, leading to a correct send
> operation (and this without involving Open MPI at all). There is
> nothing (that has a reasonable cost) we can do about this.
> For IB, look like the network timeout is smaller. Open MPI knew that
> something was wrong (the output prove it), and tried to continue
> using the other available devices. If none are available, then Open
> MPI is supposed to abort the job. For your particular run did you had
> Ethernet between the nodes ? If yes, I'm quite sure the MPI run
> wasn't stopped ... it continued using the TCP device (if not disabled
> by hand at mpirun time).
> That's not what is supposed to happens right now. If there are other
> devices (such as TCP) the MPI job will print out some warnings and
> will continue over the remaining networks (some will continue to use
> the other networks, only the peer where the network went down get
> affected). If the network timeout is too high, Open MPI will never
> notice that something went wrong. At least not the default message
> layer (PML).
> If you want to have the job abort when your main network goes down,
> disable the usage of the others available network. More specifically
> disable the TCP. A simple way to do it, it's to add the following
> argument to your mpirun command:
> --mca btl ^tcp (or --mca btl opnib,sm,self).
> Thanks,
> george.
> PS: There are several internal message passing modules available for
> Open MPI. The default one, look more for performance than
> reliability. If reliability it's what you need you should use the DR
> PML. For this, you can specify --mca pml dr at mpirun time. This (DR)
> PML has data reliability and timeout (Open MPI internal timeout that
> are configurable), allowing to recover faster from a network failure.
> On Oct 26, 2006, at 3:52 PM, Troy Telford wrote:
> > I've recently had the chance to see how Open MPI (as well as other
> > MPIs)
> > behave in the case of network failure.
> >
> > I've looked at what happens when a node has its network connection
> > disconnected in the middle of a job, with Ethernet, Myrinet (GM),
> and
> > InfiniBand (OpenIB).
> >
> > With Ethernet and Myrinet, the job more or less pauses until the
> > cable is
> > re-connected. (I imagine timeouts still apply, but I wasn't patient
> > enough to wait for them)
> >
> > With InfiniBand, the job pauses and Open MPI throws a few error
> > messages.
> > After the cable is plugged back in (and the SM catches up), the job
> > remains where it was when it was paused. I'd guess that part of
> > this is
> > that the timeout is much shorter with IB than with Myri or
> > Ethernet, and
> > that when I unplug the IB cable, it times out fairly quickly (and
> then
> > Open MPI throws its error messages).
> >
> > At any rate, the thought occurs (and it may just be my ignorance of
> > MPI):
> > After a network connection times out (as was apparently the case
> > with IB),
> > is the job salvageable? If the jobs are not salvageable, why
> > didn't Open
> > MPI abort the job (and clean up the running processes on the nodes)?
> > --
> > Troy Telford
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> >
> _______________________________________________
> users mailing list
> users_at_[hidden]
> --
> Devil wanted omnipresence;
> He therefore created communists.
> _______________________________________________
> users mailing list
> users_at_[hidden]