On Jan 28, 2013, at 10:53 AM, Maxime Boissonneault <maxime.boissonneault@calculquebec.ca> wrote:

While our filesystem and management nodes are on UPS, our compute nodes are not. With one average generic (power/cooling mostly) failure every one or two months, running for weeks is just asking for trouble. If you add to that typical dimm/cpu/networking failures (I estimated about 1 node goes down per day because of some sort hardware failure, for a cluster of 960 nodes). With these numbers, a job running on 32 nodes for 7 days has a ~35% chance of failing before it is done.

I've been running this in my head all day - it just doesn't fit experience, which really bothered me. So I spent a little time running the calculation, and I came up with a number much lower (more like around 5%). I'm not saying my rough number is correct, but it is at least a little closer to what we see in the field.

Given that there are a lot of assumptions required when doing these calculations, I would like to suggest you conduct a very simply and quick experiment before investing tons of time on FT solutions. All you have to do is:

1. run your application on your existing cluster using your favorite non-FT version of OMPI (I believe you were using 1.6.3). If you like, you can run multiple copies in parallel using different sets of nodes.

2. any time you have a failure, add it to your count and record the reason (power failure vs node hardware vs software).

If you do that for two months, you'll build a very accurate assessment of the true failure rate on your cluster. My expectation is that you might see 0-2 failures over 32 runs (4 running in parallel each week) during that entire period, and learn that FT really isn't an issue for your installation.

If it turns out you see power failure problems, then a simple, low-cost, ride-thru power stabilizer might be a good solution. Flywheels and capacitor-based systems can provide support for momentary power quality issues at reasonably low costs for a cluster of your size.

If your node hardware is the problem, or you decide you do want/need to pursue an FT solution, then you might look at the OMPI-based solutions from parties such as http://fault-tolerance.org or the MPICH2 folks.

HTH
Ralph