Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] checkpointing on other transports
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2012-01-17 12:45:21

I have not tried to support a MTL with the checkpointing functionality, so
I do not have first hand experience with those - just the OB1/BML/BTL stack.

The difficulty in porting to a new transport is really a function of how
the transport interacts with the checkpointer (e.g., BLCR). The draining
logic is handled above the PML level (in the CRCP framework), so the MTL
would only have to implement a ft_event() handler. The ft_event() handler
needs to (1) prepare the transport for checkpointing (the channel is know
to be clear at this point, but you may have to handle registered memory and
things like that), (2) continue operation after a checkpoint in the same
process image, and (3) restarting the transport on recovery into a new
process image (usually something like reinitializing the driver).

The easiest way to implement these is to shutdown the driver on checkpoint
prep (something like a finalize function) and reinitialize it on
continue/restart phases (something like an init function). Depending on the
transport driver you might be able to do something better (like we do for
tcp and sm), but it is really transport driver specific.

If you decide to dig into this, let me know how it goes and if I can be of
further help.

-- Josh

On Thu, Jan 12, 2012 at 8:16 AM, Dave Love <d.love_at_[hidden]> wrote:

> What would be involved in adding checkpointing to other transports,
> specifically the PSM MTL? Are there (likely to be?) technical
> obstacles, and would it be a lot of work if not? I'm asking in case it
> would be easy, and we don't have to exclude QLogic from a procurement,
> given they won't respond about open-mpi support.
> _______________________________________________
> users mailing list
> users_at_[hidden]

Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory