Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Checkpointing an MPI application with OMPI
From: Maxime Boissonneault (maxime.boissonneault_at_[hidden])
Date: 2013-01-28 15:46:42

Hi George,
The problem here is not the bandwidth, but the number of IOPs. I wrote
to the BLCR list, and they confirmed that :
"While ideally the checkpoint would be written in sizable chunks, the
current code in BLCR will issue a single write operation for each
contiguous range of user memory, and many quite small writes for various
meta-data and non-memory state (registers, signal-handlers,etc). As
show in Table 1 of the paper cited above, the writes in the 10s of KB
range will dominate performance."

(Reference being : X. Ouyang, R. Rajachandrasekhar, X. Besseron, H.
Wang, J. Huang and D. K. Panda, CRFS: A Lightweight User-Level
Filesystem for Generic Checkpoint/Restart, Int'l Conference on Parallel
Processing (ICPP '11), Sept. 2011. (PDF

We did run parallel IO benchmarks. Our filesystem can reach a speed of
~15GB/s, but only with large IO operations (at least bigger than 1MB,
optimally in the 100MB-1GB range). For small (<1MB) operations, the
filesystem is considerably slower. I believe this is precisely what is
killing the performance here.

Not sure there is anything to be done about it.

Best regards,


Le 2013-01-28 15:40, George Bosilca a écrit :
> At the scale you address you should have no trouble with the C/R if
> the file system is correctly configured. We get more bandwidth per
> node out of an NFS over 1Gb/s at 32 nodes. Have you run some parallel
> benchmarks on your cluster ?
> George.
> PS: You can some MPI I/O benchmarks at
> On Mon, Jan 28, 2013 at 2:04 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>> On Jan 28, 2013, at 10:53 AM, Maxime Boissonneault <maxime.boissonneault_at_[hidden]> wrote:
>>> Le 2013-01-28 13:15, Ralph Castain a écrit :
>>>> On Jan 28, 2013, at 9:52 AM, Maxime Boissonneault <maxime.boissonneault_at_[hidden]> wrote:
>>>>> Le 2013-01-28 12:46, Ralph Castain a écrit :
>>>>>> On Jan 28, 2013, at 8:25 AM, Maxime Boissonneault <maxime.boissonneault_at_[hidden]> wrote:
>>>>>>> Hello Ralph,
>>>>>>> I agree that ideally, someone would implement checkpointing in the application itself, but that is not always possible (commercial applications, use of complicated libraries, algorithms with no clear progression points at which you can interrupt the algorithm and start it back from there).
>>>>>> Hmmm...well, most apps can be adjusted to support it - we have some very complex apps that were updated that way. Commercial apps are another story, but we frankly don't find much call for checkpointing those as they typically just don't run long enough - especially if you are only running 256 ranks, so your cluster is small. Failure rates just don't justify it in such cases, in our experience.
>>>>>> Is there some particular reason why you feel you need checkpointing?
>>>>> This specific case is that the jobs run for days. The risk of a hardware or power failure for that kind of duration goes too high (we allow for no more than 48 hours of run time).
>>>> I'm surprised by that - we run with UPS support on the clusters, but for a small one like you describe, we find the probability that a job will be interrupted even during a multi-week app is vanishingly small.
>>>> FWIW: I do work with the financial industry where we regularly run apps that execute non-stop for about a month with no reported failures. Are you actually seeing failures, or are you anticipating them?
>>> While our filesystem and management nodes are on UPS, our compute nodes are not. With one average generic (power/cooling mostly) failure every one or two months, running for weeks is just asking for trouble.
>> Wow, that is high
>>> If you add to that typical dimm/cpu/networking failures (I estimated about 1 node goes down per day because of some sort hardware failure, for a cluster of 960 nodes).
>> That is incredibly high
>>> With these numbers, a job running on 32 nodes for 7 days has a ~35% chance of failing before it is done.
>> I've never seen anything like that behavior in practice - a 32 node cluster typically runs for quite a few months without a failure. It is a typical size for the financial sector, so we have a LOT of experience with such clusters.
>> I suspect you won't see anything like that behavior...
>>> Having 24GB of ram per node, even if a 32 nodes job takes close to 100% of the ram, that's merely 640 GB of data. Writing that on a lustre filesystem capable of reaching ~15GB/s should take no more than a few minutes if written correctly. Right now, I am getting a few minutes for a hundredth of this amount of data!
>> Agreed - but I don't think you'll get that bandwidth for checkpointing. I suspect you'll find that checkpointing really has troubles when scaling, which is why you don't see it used in production (at least, I haven't). Mostly used for research by a handful of organizations, which is why we haven't been too concerned about its loss.
>>>>> While it is true we can dig through the code of the library to make it checkpoint, BLCR checkpointing just seemed easier.
>>>> I see - just be aware that checkpoint support in OMPI will disappear in v1.7 and there is no clear timetable for restoring it.
>>> That is very good to know. Thanks for the information. It is too bad though.
>>>>>>> There certainly must be a better way to write the information down on disc though. Doing 500 IOPs seems to be completely broken. Why isn't there buffering involved ?
>>>>>> I don't know - that's all done in BLCR, I believe. Either way, it isn't something we can address due to the loss of our supporter for c/r.
>>>>> I suppose I should contact BLCR instead then.
>>>> For the disk op problem, I think that's the way to go - though like I said, I could be wrong and the disk writes could be something we do inside OMPI. I'm not familiar enough with the c/r code to state it with certainty.
>>>>> Thank you,
>>>>> Maxime
>>>>>> Sorry we can't be of more help :-(
>>>>>> Ralph
>>>>>>> Thanks,
>>>>>>> Maxime
>>>>>>> Le 2013-01-28 10:58, Ralph Castain a écrit :
>>>>>>>> Our c/r person has moved on to a different career path, so we may not have anyone who can answer this question.
>>>>>>>> What we can say is that checkpointing at any significant scale will always be a losing proposition. It just takes too long and hammers the file system. People have been working on extending the capability with things like "burst buffers" (basically putting an SSD in front of the file system to absorb the checkpoint surge), but that hasn't become very common yet.
>>>>>>>> Frankly, what people have found to be the "best" solution is for your app to periodically write out its intermediate results, and then take a flag that indicates "read prior results" when it starts. This minimizes the amount of data being written to the disk. If done correctly, you would only lose whatever work was done since the last intermediate result was written - which is about equivalent to losing whatever works was done since the last checkpoint.
>>>>>>>> HTH
>>>>>>>> Ralph
>>>>>>>> On Jan 28, 2013, at 7:47 AM, Maxime Boissonneault <maxime.boissonneault_at_[hidden]> wrote:
>>>>>>>>> Hello,
>>>>>>>>> I am doing checkpointing tests (with BLCR) with an MPI application compiled with OpenMPI 1.6.3, and I am seeing behaviors that are quite strange.
>>>>>>>>> First, some details about the tests :
>>>>>>>>> - The only filesystem available on the nodes are 1) one tmpfs, 2) one lustre shared filesystem (tested to be able to provide ~15GB/s for writing and support ~40k IOPs).
>>>>>>>>> - The job was running with 8 or 16 MPI ranks on nodes with 8 cores (1 or 2 nodes). Each MPI rank was using approximately 200MB of memory.
>>>>>>>>> - I was doing checkpoints with ompi-checkpoint and restarting with ompi-restart.
>>>>>>>>> - I was starting with mpirun -am ft-enable-cr
>>>>>>>>> - The nodes are monitored by ganglia, which allows me to see the number of IOPs and the read/write speed on the filesystem.
>>>>>>>>> I tried a few different mca settings, but I consistently observed that :
>>>>>>>>> - The checkpoints lasted ~4-5 minutes each time
>>>>>>>>> - During checkpoint, each node (8 ranks) was doing ~500 IOPs, and writing at ~15MB/s.
>>>>>>>>> I am worried by the number of IOPs and the very slow writing speed. This was a very small test. We have jobs running with 128 or 256 MPI ranks, each using 1-2 GB of ram per rank. With such jobs, the overall number of IOPs would reach tens of thousands and would completely overload our lustre filesystem. Moreover, with 15MB/s per node, the checkpointing process would take hours.
>>>>>>>>> How can I improve on that ? Is there an MCA setting that I am missing ?
>>>>>>>>> Thanks,
>>>>>>>>> --
>>>>>>>>> ---------------------------------
>>>>>>>>> Maxime Boissonneault
>>>>>>>>> Analyste de calcul - Calcul Québec, Université Laval
>>>>>>>>> Ph. D. en physique
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users_at_[hidden]
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>> --
>>>>>>> ---------------------------------
>>>>>>> Maxime Boissonneault
>>>>>>> Analyste de calcul - Calcul Québec, Université Laval
>>>>>>> Ph. D. en physique
>>>>> --
>>>>> ---------------------------------
>>>>> Maxime Boissonneault
>>>>> Analyste de calcul - Calcul Québec, Université Laval
>>>>> Ph. D. en physique
>>> --
>>> ---------------------------------
>>> Maxime Boissonneault
>>> Analyste de calcul - Calcul Québec, Université Laval
>>> Ph. D. en physique
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]

Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique