Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Running on crashing nodes
From: Randolph Pullen (randolph_pullen_at_[hidden])
Date: 2010-09-27 22:53:15


I have have successfully used a perl program to start mpirun and record its PIDThe monitor can then watch the output from MPI and terminate the mpirun command with a series of kills or something if it is having trouble.

One method of doing this is to prefix all legal output from your MPI program with a known short string, if the monitor does not see this string prefixed on a line, it can terminate MPI, check available nodes and recast the job accordingly
Hope this helps,Randolph
--- On Fri, 24/9/10, Joshua Hursey <jjhursey_at_[hidden]> wrote:

From: Joshua Hursey <jjhursey_at_[hidden]>
Subject: Re: [OMPI users] Running on crashing nodes
To: "Open MPI Users" <users_at_[hidden]>
Received: Friday, 24 September, 2010, 10:18 PM

As one of the Open MPI developers actively working on the MPI layer stabilization/recover feature set, I don't think we can give you a specific timeframe for availability, especially availability in a stable release. Once the initial functionality is finished, we will open it up for user testing by making a public branch available. After addressing the concerns highlighted by public testing, we will attempt to work this feature into the mainline trunk and eventual release.

Unfortunately it is difficult to assess the time needed to go through these development stages. What I can tell you is that the work to this point on the MPI layer is looking promising, and that as soon as we feel that the code is ready we will make it available to the public for further testing.

-- Josh

On Sep 24, 2010, at 3:37 AM, Andrei Fokau wrote:

> Ralph, could you tell us when this functionality will be available in the stable version? A rough estimate will be fine.
>
>
> On Fri, Sep 24, 2010 at 01:24, Ralph Castain <rhc_at_[hidden]> wrote:
> In a word, no. If a node crashes, OMPI will abort the currently-running job if it had processes on that node. There is no current ability to "ride-thru" such an event.
>
> That said, there is work being done to support "ride-thru". Most of that is in the current developer's code trunk, and more is coming, but I wouldn't consider it production-quality just yet.
>
> Specifically, the code that does what you specify below is done and works. It is recovery of the MPI job itself (collectives, lost messages, etc.) that remains to be completed.
>
>
> On Thu, Sep 23, 2010 at 7:22 AM, Andrei Fokau <andrei.fokau_at_[hidden]> wrote:
> Dear users,
>
> Our cluster has a number of nodes which have high probability to crash, so it happens quite often that calculations stop due to one node getting down. May be you know if it is possible to block the crashed nodes during run-time when running with OpenMPI? I am asking about principal possibility to program such behavior. Does OpenMPI allow such dynamic checking? The scheme I am curious about is the following:
>
> 1. A code starts its tasks via mpirun on several nodes
> 2. At some moment one node gets down
> 3. The code realizes that the node is down (the results are lost) and excludes it from the list of nodes to run its tasks on
> 4. At later moment the user restarts the crashed node
> 5. The code notices that the node is up again, and puts it back to the list of active nodes
>
>
> Regards,
> Andrei
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> <ATT00001..txt>

------------------------------------
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://www.cs.indiana.edu/~jjhursey

_______________________________________________
users mailing list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users