Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: Resilient ORTE
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-06-07 12:14:14


On Tue, Jun 7, 2011 at 9:45 AM, Wesley Bland <wbland_at_[hidden]> wrote:

> To adress your concerns about putting the epoch in the process name
> structure, putting it in there rather than in a separately maintained list
> simplifies things later.
>

Not really concerned - I was just noting we had done it a tad differently,
but nothing important.

>
> For example, during communication you need to attach the epoch to each of
> your messages so they can be tracked later. If a process dies while the
> message is in flight, or you need to cancel your communication, you need to
> be able to find the matching message to the matching epoch. If the epoch
> isn't in the process name, then you have to modify to the message header for
> each type of message to include that information. Each process not only
> needs to know what the current version of the epoch is from it's own
> perspective, but also from the perspective of whomever is sending the
> message.
>

But the epoch is process-unique - i.e., it is the number of times that this
specific process has been started, which differs per proc since we don't
restart all the procs every time one fails. So if I look at the epoch of the
proc sending me a message, I really can't check it against my own value as
the comparison is meaningless. All I really can do is check to see if it
changed from the last time I heard from that proc, which would tell me that
the proc has been restarted in the interim.

> This is also true for things like reporting failures. To prevent duplicate
> notifications you would need to include your epoch in all the notifications
> so no one marks a process as failing twice.
>

I'm not sure of the relevance here. We handle this without problem right now
(at least, within orcm - haven't looked inside orte yet to see what needs to
be brought back, if anything) without an epoch - and the state machine will
resolve the remaining race conditions, which really don't pertain to epoch
anyway.

>
> Really the point is that by changing the process name, you prevent the need
> to pack the epoch each time you have any sort of communication. All that
> work is done along with packing the rest of the structure.
>

No argument - I don't mind having the value in the name. Makes no difference
to me.

> On Tuesday, June 7, 2011 at 11:21 AM, Ralph Castain wrote:
>
> Thanks for the explanation - as I said, I won't have time to really review
> the patch this week, but appreciate the info. I don't really expect to see a
> conflict as George had discussed this with me previously.
>
> I know I'll have merge conflicts with my state machine branch, which would
> be ready for commit in the same time frame, but I'll hold off on that one
> and deal with the merge issues on my side.
>
>
>
> On Tue, Jun 7, 2011 at 8:46 AM, Wesley Bland <wbland_at_[hidden]> wrote:
>
> This could certainly work alongside another ORCM or any other fault
> detection/prediction/recovery mechanism. Most of the code is just dedicated
> to keeping the epoch up to date and tracking the status of the processes.
> The underlying idea was to provide a way for the application to decide what
> its fault policy would be rather than trying to dictate one in the runtime.
> If any other layer wanted to register a callback function with this code, it
> could do anything it wanted to on top of it.
>
> Wesley
>
> On Tuesday, June 7, 2011 at 7:41 AM, Ralph Castain wrote:
>
> I'm on travel this week, but will look this over when I return. From the
> description, it sounds nearly identical to what we did in ORCM, so I expect
> there won't be many issues. You do get some race conditions that the new
> state machine code should help resolve.
>
> Only difference I can quickly see is that we chose not to modify the
> process name structure, keeping the "epoch" (we called it "incarnation") as
> a separate value. Since we aren't terribly concerned about backward
> compatibility, I don't consider this a significant issue - but something the
> community should recognize.
>
> My main concern will be to ensure that the new code contains enough
> flexibility to allow integration with other layers such as ORCM without
> creating potential conflict over "double protection" - i.e., if the layer
> above ORTE wants to provide a certain level of fault protection, then ORTE
> needs to get out of the way.
>
>
> On Mon, Jun 6, 2011 at 1:00 PM, George Bosilca <bosilca_at_[hidden]>wrote:
>
> WHAT: Allow the runtime to handle fail-stop failures for both runtime
> (daemons) or application level processes. This patch extends the
> orte_process_name_t structure with a field to store the process epoch (the
> number of times it died so far), and add an application failure notification
> callback function to be registered in the runtime.
>
> WHY: Necessary to correctly implement the error handling in the MPI 2.2
> standard. In addition, such a resilient runtime is a cornerstone for any
> level of fault tolerance support we want to provide in the future (such as
> the MPI-3 Run-Through Stabilization or FT-MPI).
>
> WHEN:
>
> WHERE: Patch attached to this email, based on trunk r24747.
> TIMEOUT: 2 weeks from now, on Monday 20 June.
>
> ------
>
> MORE DETAILS:
>
> Currently the infrastructure required to enable any kind of fault tolerance
> development in Open MPI (with the exception of the checkpoint/restart) is
> missing. However, before developing any fault tolerant support at the
> application (MPI) level, we need to have a resilient runtime. The changes in
> this patch address this lack of support and would allow anyone to implement
> a fault tolerance protocol at the MPI layer without having to worry about
> the ORTE stabilization.
>
> This patch will allow the runtime to drop any dead daemons, and re-route
> all communications around the holes in order to __ALWAYS__ deliver a message
> as long as the destination process is alive. The application is informed
> (via a callback) about the loss of the processes with the same jobid. In
> this patch we do not address the MPI_ERROR_RETURN type of failures, we
> focused on the MPI_ERROR_ABORT ones. Moreover, we empowered the application
> level with the decision, instead of taking it down in the runtime.
>
> NEW STUFF:
>
> Epoch - A counter that tracks the number of times a process has been
> detected to have terminated, either from a failure or an expected
> termination. After the termination is detected, the HNP coordinates all
> other process’s knowledge of the new epoch. Each ORTED will know the epoch
> of the other processes in the job, but it will not actually store anything
> until the epochs change.
>
> Run-Through Stabilization - When an ORTED (or HNP) detects that another
> process has terminated, it repairs the routing layer and informs the HNP.
> The HNP tells all other processes about the failure so they can also repair
> their routing layers an update their internal bookkeeping. The processes do
> not abort after the termination is detected.
>
> Callback Function - When the HNP tells all the ORTEDs about the failures,
> they tell the ORTE layers within the applications. The application level
> ORTE layers have a callback function that they use to inform the OMPI layer
> about the error. Currently the OMPI errhandler code fills in this callback
> function so it is informed when there is an error and it aborts (to maintain
> the current default behavior of MPI). This callback function can also be
> used in an ORTE only application to perform application based fault
> tolerance (ABFT) and allow the application to continue.
>
> NECESSARY FOR IMPLEMENTATION:
>
> Epoch - The orte_process_name_t struct now has a field for epoch. This
> means that whenever sending a message, the most current version of the epoch
> needs to be in this field. This is a simple look up using the function in
> orte/util/nidmap.c: orte_util_lookup_epoch(). In the orte/orted/orted_comm.c
> code, there is a check to make sure that it isn’t trying to send messages to
> a process that has already terminated (don’t send to a process with an epoch
> less than the current epoch). Make sure that if you are sending a message,
> you have the most up to date data here.
>
> Routing - So far, only the binomial routing layer has been updated to use
> the new resilience features. To modify other routing layers to be able to
> continue running after a process failure, they need to be able to detect
> which processes are not currently running and route around them. The errmgr
> gives the routing layer two chances to do this. First it calls delete_route
> for each process that fails, then it calls update_routing_tree after it has
> appropriately marked each process. Before either of these things happen the
> epoch and process state have already been updates so the routing layer can
> use this data to determine which processes are alive and which are dead. A
> convenience function has been added to orte/util/nidmap.h called
> orte_util_proc_is_running() which allows the ORTEDs to determine the status
> of a process. Keep in mind that a process is not running if it hasn’t
> started up yet so it is wise to check the epoch (to make sure that it isn’t
> ORTE_EPOCH_MIN) as well to make sure that you’re actually detecting an error
> and not just noticing that an ORTED hasn’t finished starting.
>
> Callback - If you want to implement some sort of fault tolerance on top of
> this code, use the callback function in the errmgr framework. There is a new
> function in the errmgr code called set_fault_callback that takes a function
> pointer. The ompi_init code sets this to a default value just after it calls
> orte_init (to make sure that there is an errmgr to call into). If you later
> set this to a new function, you will get the callback to notify you of
> process failures. Remember that you’ll need to handle any sort of MPI level
> fault tolerance at this point because you’ve taken away the callback for the
> OMPI layer.
>
>
>
>
>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>