This is a bit late in the thread, but I wanted to add one more note.
The functionality that made it to v1.6 is fairly basic in terms of C/R
support in Open MPI. It supported a global checkpoint write, and (for a
time) a simple staged option (I think that is now broken).
In the trunk (about 3 years ago is when it was all committed) we extended
the support to allow the user a bit more control over how the checkpoint
files are managed (in addition to other features like automatic recovery,
process migration, and debugging support). These storage techniques allowed
the user to request that a local tmp disk be used to stage a checkpoint.
This allows BLCR to write to the local file system and the application to
continue running while the checkpoint is being moved. Open MPI would stage
it back to the global file system (there were some quality of service
controls, and compression options). This effort was to help alleviate some
of the load on the network file system during the checkpoint burst - since
we are using a fully coordinated approach. It helped quite a bit in the
experiments that I ran as part of my PhD.
Unfortunately, since that initial commit we have not been able to create a
release that includes those additional features. Most of the blame goes to
me for not having the resources to sustain support for them after
completing my PhD until now (as we start to prepare 1.7). So this has lead
to the unfortunate, but realistic situation where it will not be included
in 1.7 and is not available as a configuration option in the trunk (most of
the code is present, but it is know to not function correctly).
My hope is to bring the C/R support back in the future, but I cannot commit
to any specific date at this time. As an open-source project, we are always
looking for developers to help out. So if you (or anyone else on the list)
are interested in helping bring this support back I am willing to help
advise where I can.
On Wed, Jan 30, 2013 at 8:18 AM, Maxime Boissonneault <
> Le 2013-01-29 21:02, Ralph Castain a écrit :
> On Jan 28, 2013, at 10:53 AM, Maxime Boissonneault <
> maxime.boissonneault_at_[hidden]> wrote:
> While our filesystem and management nodes are on UPS, our compute nodes
> are not. With one average generic (power/cooling mostly) failure every one
> or two months, running for weeks is just asking for trouble. If you add to
> that typical dimm/cpu/networking failures (I estimated about 1 node goes
> down per day because of some sort hardware failure, for a cluster of 960
> nodes). With these numbers, a job running on 32 nodes for 7 days has a ~35%
> chance of failing before it is done.
> I've been running this in my head all day - it just doesn't fit
> experience, which really bothered me. So I spent a little time running the
> calculation, and I came up with a number much lower (more like around 5%).
> I'm not saying my rough number is correct, but it is at least a little
> closer to what we see in the field.
> Given that there are a lot of assumptions required when doing these
> calculations, I would like to suggest you conduct a very simply and quick
> experiment before investing tons of time on FT solutions. All you have to
> do is:
> Thanks for the calculation. However, this is a cluster that I manage, I
> do not use it per say, and running such statistical jobs on a large part of
> the cluster for a long period of time is impossible. We do have the numbers
> however. The cluster has 960 nodes. We experience roughly one power or
> cooling failure per month or two months. Assuming one such failure per two
> months, if you run for 1 month, you have a 50% chance your job will be
> killed before it ends. If you run for 2 weeks, 25%, etc. These are very
> rough estimates obviously, but it is way more than 5%.
> In addition to that, we have a failure rate of ~0.1%/day, meaning that out
> of 960, on average, one node will have a hardware failure every day. Most
> of the time, this is a failure of one of the dimms. Considering each node
> has 12 dimms of 2GB of memory, it means a dimm failure rate of ~0.0001 per
> day. I don't know if that's bad or not, but this is roughly what we have.
> If it turns out you see power failure problems, then a simple, low-cost,
> ride-thru power stabilizer might be a good solution. Flywheels and
> capacitor-based systems can provide support for momentary power quality
> issues at reasonably low costs for a cluster of your size.
> I doubt there is anything low cost for a 330 kW system, and in any case,
> hardware upgrade is not an option since this a mid-life cluster. Again, as
> I said, the filesystem (2 x 500 TB lustre partitions) and the management
> nodes are on UPS, but there is no way to put the compute nodes on UPS.
> If your node hardware is the problem, or you decide you do want/need to
> pursue an FT solution, then you might look at the OMPI-based solutions from
> parties such as http://fault-tolerance.org or the MPICH2 folks.
> Thanks for the tip.
> Best regards,
> users mailing list
Assistant Professor of Computer Science
University of Wisconsin-La Crosse