Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Checkpointing an MPI application with OMPI
From: George Bosilca (bosilca_at_[hidden])
Date: 2013-01-28 16:00:48


Based on the paper you linked the answer is quite obvious. The
proposed CRFS mechanism supports all of the checkpoint-enabled MPI
implementation, thus you just have to go with the one providing and
caring about the services you need.

  George.

On Mon, Jan 28, 2013 at 3:46 PM, Maxime Boissonneault
<maxime.boissonneault_at_[hidden]> wrote:
> Hi George,
> The problem here is not the bandwidth, but the number of IOPs. I wrote to
> the BLCR list, and they confirmed that :
> "While ideally the checkpoint would be written in sizable chunks, the
> current code in BLCR will issue a single write operation for each contiguous
> range of user memory, and many quite small writes for various meta-data and
> non-memory state (registers, signal-handlers,etc). As show in Table 1 of
> the paper cited above, the writes in the 10s of KB range will dominate
> performance."
>
> (Reference being : X. Ouyang, R. Rajachandrasekhar, X. Besseron, H. Wang, J.
> Huang and D. K. Panda, CRFS: A Lightweight User-Level Filesystem for Generic
> Checkpoint/Restart, Int'l Conference on Parallel Processing (ICPP '11),
> Sept. 2011. (PDF))
>
> We did run parallel IO benchmarks. Our filesystem can reach a speed of
> ~15GB/s, but only with large IO operations (at least bigger than 1MB,
> optimally in the 100MB-1GB range). For small (<1MB) operations, the
> filesystem is considerably slower. I believe this is precisely what is
> killing the performance here.
>
> Not sure there is anything to be done about it.
>
> Best regards,
>
>
> Maxime
>
> Le 2013-01-28 15:40, George Bosilca a écrit :
>
> At the scale you address you should have no trouble with the C/R if
> the file system is correctly configured. We get more bandwidth per
> node out of an NFS over 1Gb/s at 32 nodes. Have you run some parallel
> benchmarks on your cluster ?
>
> George.
>
> PS: You can some MPI I/O benchmarks at
> http://www.mcs.anl.gov/~thakur/pio-benchmarks.html
>
>
>
> On Mon, Jan 28, 2013 at 2:04 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>
> On Jan 28, 2013, at 10:53 AM, Maxime Boissonneault
> <maxime.boissonneault_at_[hidden]> wrote:
>
> Le 2013-01-28 13:15, Ralph Castain a écrit :
>
> On Jan 28, 2013, at 9:52 AM, Maxime Boissonneault
> <maxime.boissonneault_at_[hidden]> wrote:
>
> Le 2013-01-28 12:46, Ralph Castain a écrit :
>
> On Jan 28, 2013, at 8:25 AM, Maxime Boissonneault
> <maxime.boissonneault_at_[hidden]> wrote:
>
> Hello Ralph,
> I agree that ideally, someone would implement checkpointing in the
> application itself, but that is not always possible (commercial
> applications, use of complicated libraries, algorithms with no clear
> progression points at which you can interrupt the algorithm and start it
> back from there).
>
> Hmmm...well, most apps can be adjusted to support it - we have some very
> complex apps that were updated that way. Commercial apps are another story,
> but we frankly don't find much call for checkpointing those as they
> typically just don't run long enough - especially if you are only running
> 256 ranks, so your cluster is small. Failure rates just don't justify it in
> such cases, in our experience.
>
> Is there some particular reason why you feel you need checkpointing?
>
> This specific case is that the jobs run for days. The risk of a hardware or
> power failure for that kind of duration goes too high (we allow for no more
> than 48 hours of run time).
>
> I'm surprised by that - we run with UPS support on the clusters, but for a
> small one like you describe, we find the probability that a job will be
> interrupted even during a multi-week app is vanishingly small.
>
> FWIW: I do work with the financial industry where we regularly run apps that
> execute non-stop for about a month with no reported failures. Are you
> actually seeing failures, or are you anticipating them?
>
> While our filesystem and management nodes are on UPS, our compute nodes are
> not. With one average generic (power/cooling mostly) failure every one or
> two months, running for weeks is just asking for trouble.
>
> Wow, that is high
>
> If you add to that typical dimm/cpu/networking failures (I estimated about 1
> node goes down per day because of some sort hardware failure, for a cluster
> of 960 nodes).
>
> That is incredibly high
>
> With these numbers, a job running on 32 nodes for 7 days has a ~35% chance
> of failing before it is done.
>
> I've never seen anything like that behavior in practice - a 32 node cluster
> typically runs for quite a few months without a failure. It is a typical
> size for the financial sector, so we have a LOT of experience with such
> clusters.
>
> I suspect you won't see anything like that behavior...
>
> Having 24GB of ram per node, even if a 32 nodes job takes close to 100% of
> the ram, that's merely 640 GB of data. Writing that on a lustre filesystem
> capable of reaching ~15GB/s should take no more than a few minutes if
> written correctly. Right now, I am getting a few minutes for a hundredth of
> this amount of data!
>
> Agreed - but I don't think you'll get that bandwidth for checkpointing. I
> suspect you'll find that checkpointing really has troubles when scaling,
> which is why you don't see it used in production (at least, I haven't).
> Mostly used for research by a handful of organizations, which is why we
> haven't been too concerned about its loss.
>
>
> While it is true we can dig through the code of the library to make it
> checkpoint, BLCR checkpointing just seemed easier.
>
> I see - just be aware that checkpoint support in OMPI will disappear in v1.7
> and there is no clear timetable for restoring it.
>
> That is very good to know. Thanks for the information. It is too bad though.
>
> There certainly must be a better way to write the information down on disc
> though. Doing 500 IOPs seems to be completely broken. Why isn't there
> buffering involved ?
>
> I don't know - that's all done in BLCR, I believe. Either way, it isn't
> something we can address due to the loss of our supporter for c/r.
>
> I suppose I should contact BLCR instead then.
>
> For the disk op problem, I think that's the way to go - though like I said,
> I could be wrong and the disk writes could be something we do inside OMPI.
> I'm not familiar enough with the c/r code to state it with certainty.
>
> Thank you,
>
> Maxime
>
> Sorry we can't be of more help :-(
> Ralph
>
> Thanks,
>
> Maxime
>
>
> Le 2013-01-28 10:58, Ralph Castain a écrit :
>
> Our c/r person has moved on to a different career path, so we may not have
> anyone who can answer this question.
>
> What we can say is that checkpointing at any significant scale will always
> be a losing proposition. It just takes too long and hammers the file system.
> People have been working on extending the capability with things like "burst
> buffers" (basically putting an SSD in front of the file system to absorb the
> checkpoint surge), but that hasn't become very common yet.
>
> Frankly, what people have found to be the "best" solution is for your app to
> periodically write out its intermediate results, and then take a flag that
> indicates "read prior results" when it starts. This minimizes the amount of
> data being written to the disk. If done correctly, you would only lose
> whatever work was done since the last intermediate result was written -
> which is about equivalent to losing whatever works was done since the last
> checkpoint.
>
> HTH
> Ralph
>
> On Jan 28, 2013, at 7:47 AM, Maxime Boissonneault
> <maxime.boissonneault_at_[hidden]> wrote:
>
> Hello,
> I am doing checkpointing tests (with BLCR) with an MPI application compiled
> with OpenMPI 1.6.3, and I am seeing behaviors that are quite strange.
>
> First, some details about the tests :
> - The only filesystem available on the nodes are 1) one tmpfs, 2) one lustre
> shared filesystem (tested to be able to provide ~15GB/s for writing and
> support ~40k IOPs).
> - The job was running with 8 or 16 MPI ranks on nodes with 8 cores (1 or 2
> nodes). Each MPI rank was using approximately 200MB of memory.
> - I was doing checkpoints with ompi-checkpoint and restarting with
> ompi-restart.
> - I was starting with mpirun -am ft-enable-cr
> - The nodes are monitored by ganglia, which allows me to see the number of
> IOPs and the read/write speed on the filesystem.
>
> I tried a few different mca settings, but I consistently observed that :
> - The checkpoints lasted ~4-5 minutes each time
> - During checkpoint, each node (8 ranks) was doing ~500 IOPs, and writing at
> ~15MB/s.
>
> I am worried by the number of IOPs and the very slow writing speed. This was
> a very small test. We have jobs running with 128 or 256 MPI ranks, each
> using 1-2 GB of ram per rank. With such jobs, the overall number of IOPs
> would reach tens of thousands and would completely overload our lustre
> filesystem. Moreover, with 15MB/s per node, the checkpointing process would
> take hours.
>
> How can I improve on that ? Is there an MCA setting that I am missing ?
>
> Thanks,
>
> --
> ---------------------------------
> Maxime Boissonneault
> Analyste de calcul - Calcul Québec, Université Laval
> Ph. D. en physique
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> --
> ---------------------------------
> Maxime Boissonneault
> Analyste de calcul - Calcul Québec, Université Laval
> Ph. D. en physique
>
> --
> ---------------------------------
> Maxime Boissonneault
> Analyste de calcul - Calcul Québec, Université Laval
> Ph. D. en physique
>
>
> --
> ---------------------------------
> Maxime Boissonneault
> Analyste de calcul - Calcul Québec, Université Laval
> Ph. D. en physique
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> --
> ---------------------------------
> Maxime Boissonneault
> Analyste de calcul - Calcul Québec, Université Laval
> Ph. D. en physique