I am doing checkpointing tests (with BLCR) with an MPI application
compiled with OpenMPI 1.6.3, and I am seeing behaviors that are quite
First, some details about the tests :
- The only filesystem available on the nodes are 1) one tmpfs, 2) one
lustre shared filesystem (tested to be able to provide ~15GB/s for
writing and support ~40k IOPs).
- The job was running with 8 or 16 MPI ranks on nodes with 8 cores (1 or
2 nodes). Each MPI rank was using approximately 200MB of memory.
- I was doing checkpoints with ompi-checkpoint and restarting with
- I was starting with mpirun -am ft-enable-cr
- The nodes are monitored by ganglia, which allows me to see the number
of IOPs and the read/write speed on the filesystem.
I tried a few different mca settings, but I consistently observed that :
- The checkpoints lasted ~4-5 minutes each time
- During checkpoint, each node (8 ranks) was doing ~500 IOPs, and
writing at ~15MB/s.
I am worried by the number of IOPs and the very slow writing speed. This
was a very small test. We have jobs running with 128 or 256 MPI ranks,
each using 1-2 GB of ram per rank. With such jobs, the overall number of
IOPs would reach tens of thousands and would completely overload our
lustre filesystem. Moreover, with 15MB/s per node, the checkpointing
process would take hours.
How can I improve on that ? Is there an MCA setting that I am missing ?
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique