On Feb 5, 2010, at 6:40 PM, Gene Cooperman wrote:
> You're correct that we take a virtualized approach by intercepting network
> calls, etc. However, we purposely never intercept any frequently
> called system calls. So, for example, we never intercept a call
> to read() or to write() in TCP/IP, as part of our core design principles.
> Instead, we use things like the proc filesytem and the use of system calls
> to find the offset in an open file descriptor.
> We would love the opportunity to work with you on a demonstration for
> the high-performance networks that you mention. Can you suggest an
> MPI code and the appropriate hardware testbed on which we could get an
> account and run?
Any MPI code should do -- even something as simple as a pass-the-message-around-in-a-ring app. If you can checkpoint and restart it, that's a good start.
As for high-speed networks, any iWARP, IB, or Myrinet based network should do. iWARP+IB use the OpenFabrics verbs API; Myrinet networks use the MX API. AFAIK, neither of them export counters through /sys or /proc.
> We are aware of your plugin facilities and that in addition to BLCR,
> other checkpointers can also integrate with it. And of course, we have the
> highest respect for BLCR. We think that at this time, it is best to
> continue exploring both approaches.
One clarification -- our plugin interfaces were not designed specifically to support BLCR. They were designed to support generic checkpointing facilities. Of course, we only had a few mind when they were designed, so it's possible that they might need to be extended if yours is different than at least the general model that we envisioned. But all things are do-able.
I just mention this if you wish to pursue the inside-Open-MPI plugins approach. Of course, staying outside of Open MPI is advantageous from a portability point of view.
> Although we haven't looked so closely at the plugin facility, we had
> assumed that it always interoperates with the OpenMPI checkpoint-restart
> service developed by Joshua Hursey (for which we also have very
> high respect).
Ya, he's a smart guy. But don't say it too loud or he'll get a big ego! ;-)
> Our understanding was that the OpenMPI checkpoint-restart
> service arranges to halt all MPI messages, and then it calls BLCR
> for checkpointing on the local host.
The short answer is "yes". The longer answer is that Josh designed a few different types of plugins -- some of quiescing a job, some for back-end checkpointer support, etc. Hence, one is not directly dependent on the other. I believe that he has written some papers about this... ah, here's one of them (you may have seen this already?):
> DMTCP tries to do the job of both Josh's checkpoint-restart service
> and also BLCR, and it does it all transparently by working at the TCP/IP socket
> level. So, we simply run:
> dmtcp_checkpoint mpirun ./hello_mpi
> dmtcp_command --checkpoint
> (The file QUICK-START in DMTCP has a few more details.)
> Hence, we don't use the OpenMPI checkpoint-restart service or its plugin
> interface, since we're already able to do the distributed checkpointing
> directly. If it were important, we could modify DMTCP to be called
> by the plugin, and to do checkpointing only on the local host.
I guess that's what I was asking about -- if you thought it would be worthwhile to do that: have your checkpoint service be called by Open MPI. In this way, you'd use Open MPI's infrastructure to invoke your single-process-checkpointer underneath.
I'm guessing there are advantages and disadvantages to both.
> Also, as a side comment, DMTCP was already working with OpenMPI 1.2, but then
> later versions of OpenMPI started using more sophisticated system calls.
> By then, we were already working through different tasks, and it has
> taken us this long to come back to OpenMPI and properly support it again
> through our virtualized approach (properly handling the multiple
> ptys of OpenMPI, etc.).
Gotcha. FWIW, we don't checkpoint the run-time system in Open MPI -- we only checkpoint the MPI processes. Upon restart, we rebuild the run-time system and then launch the "restart" phase in the MPI processes. In this way, we avoided a lot of special case code and took advantage of much of the infrastructure that we already had.
This could probably be construed as an advantage to working in the plugin system of Open MPI -- you'd pretty much be isolated from using more complex system calls, etc. Indeed, that was one of Josh's primary design goals: separate the "quiesce" phase from the "checkpoint" phase because they really are two different things, and they don't necessarily have to be related.
That being said, a disadvantage of this approach is that that work (i.e., the plugin -- not the actual back-end checkpointer) then becomes specifically tied to Open MPI. What we did with BLCR was to write a thin plugin that simply links to BLCR where the majority of the work is contained. Hence, the plugin was pretty small -- it just interfaces to external functionality (many of Open MPI's plugins do that -- OMPI/MPI-specific logic is in the plugin, but we link against external libraries for additional functionality). Also, when working in our plugin system, you're using our model and infrastructure -- not your own.
Josh had a lot of freedom to design our model and is finishing his PhD because of it :-), but he definitely had the "first implementor" advantage. While we certainly encourage (and want!) new and novel work, the onus is now on new proposers to show why their system would be better than the one we have, etc.
> So, in conclusion, DMTCP will work fine with OpenMPI out of the box
> for small and medium jobs. For questions of scalable computation,
> measuring overhead, and so on, we would need a partner to address those
> issues. We would do most of the work, but we would need someone more
> intimately familiar with good testbeds for OpenMPI, prioritized goals
> for OpenMPI, and so on, in order to help lead us through the challenge
> of scalability. If that partner recommends that we would integrate
> best through the OpenMPI plugin, we can certainly do that. In fact, we are
> working right now with the Condor group to have them validate DMTCP
> as a checkpointer (initially for their vanilla universe) by operating
> through the Condor checkpoint interface.
Nifty. If we want to have more detailed conversations, a phone call is likely best. Ping me off-list and we can setup a time.
Little known fact: one of the primary communication tools between Open MPI developers is the telephone (!). We all email and IM each other frequently, but you can save a week's worth of exhausting emails with a 30- or 60-minute phone conversation. :-)
For corporate legal information go to: