Le 2013-01-29 21:02, Ralph Castain a
Thanks for the calculation. However, this is a cluster that I
manage, I do not use it per say, and running such statistical jobs
on a large part of the cluster for a long period of time is
impossible. We do have the numbers however. The cluster has 960
nodes. We experience roughly one power or cooling failure per month
or two months. Assuming one such failure per two months, if you run
for 1 month, you have a 50% chance your job will be killed before it
ends. If you run for 2 weeks, 25%, etc. These are very rough
estimates obviously, but it is way more than 5%.
While our filesystem and management nodes are
on UPS, our compute nodes are not. With one average generic
(power/cooling mostly) failure every one or two months,
running for weeks is just asking for trouble. If you add to
that typical dimm/cpu/networking failures (I estimated about
1 node goes down per day because of some sort hardware
failure, for a cluster of 960 nodes). With these numbers, a
job running on 32 nodes for 7 days has a ~35% chance of
failing before it is done.
I've been running this in my head all day - it just doesn't
fit experience, which really bothered me. So I spent a little
time running the calculation, and I came up with a number much
lower (more like around 5%). I'm not saying my rough number is
correct, but it is at least a little closer to what we see in
Given that there are a lot of assumptions required when doing
these calculations, I would like to suggest you conduct a very
simply and quick experiment before investing tons of time on FT
solutions. All you have to do is:
In addition to that, we have a failure rate of ~0.1%/day, meaning
that out of 960, on average, one node will have a hardware failure
every day. Most of the time, this is a failure of one of the dimms.
Considering each node has 12 dimms of 2GB of memory, it means a dimm
failure rate of ~0.0001 per day. I don't know if that's bad or not,
but this is roughly what we have.
I doubt there is anything low cost for a 330 kW system, and in any
case, hardware upgrade is not an option since this a mid-life
cluster. Again, as I said, the filesystem (2 x 500 TB lustre
partitions) and the management nodes are on UPS, but there is no way
to put the compute nodes on UPS.
If it turns out you see power failure problems, then a
simple, low-cost, ride-thru power stabilizer might be a good
solution. Flywheels and capacitor-based systems can provide
support for momentary power quality issues at reasonably low
costs for a cluster of your size.
Thanks for the tip.
If your node hardware is the problem, or you decide you do
want/need to pursue an FT solution, then you might look at the
OMPI-based solutions from parties such as http://fault-tolerance.org
or the MPICH2 folks.