I have not tried to support a MTL with the checkpointing functionality, so I do not have first hand experience with those - just the OB1/BML/BTL stack.
The difficulty in porting to a new transport is really a function of how the transport interacts with the checkpointer (e.g., BLCR). The draining logic is handled above the PML level (in the CRCP framework), so the MTL would only have to implement a ft_event() handler. The ft_event() handler needs to (1) prepare the transport for checkpointing (the channel is know to be clear at this point, but you may have to handle registered memory and things like that), (2) continue operation after a checkpoint in the same process image, and (3) restarting the transport on recovery into a new process image (usually something like reinitializing the driver).
The easiest way to implement these is to shutdown the driver on checkpoint prep (something like a finalize function) and reinitialize it on continue/restart phases (something like an init function). Depending on the transport driver you might be able to do something better (like we do for tcp and sm), but it is really transport driver specific.
If you decide to dig into this, let me know how it goes and if I can be of further help.
On Thu, Jan 12, 2012 at 8:16 AM, Dave Love <email@example.com>
What would be involved in adding checkpointing to other transports,
specifically the PSM MTL? Are there (likely to be?) technical
obstacles, and would it be a lot of work if not? I'm asking in case it
would be easy, and we don't have to exclude QLogic from a procurement,
given they won't respond about open-mpi support.
users mailing list
Postdoctoral Research Associate
Oak Ridge National Laboratoryhttp://users.nccs.gov/~jjhursey