Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Checkpointing an MPI application with OMPI
From: Maxime Boissonneault (maxime.boissonneault_at_[hidden])
Date: 2013-01-28 10:47:11


Hello,
I am doing checkpointing tests (with BLCR) with an MPI application
compiled with OpenMPI 1.6.3, and I am seeing behaviors that are quite
strange.

First, some details about the tests :
- The only filesystem available on the nodes are 1) one tmpfs, 2) one
lustre shared filesystem (tested to be able to provide ~15GB/s for
writing and support ~40k IOPs).
- The job was running with 8 or 16 MPI ranks on nodes with 8 cores (1 or
2 nodes). Each MPI rank was using approximately 200MB of memory.
- I was doing checkpoints with ompi-checkpoint and restarting with
ompi-restart.
- I was starting with mpirun -am ft-enable-cr
- The nodes are monitored by ganglia, which allows me to see the number
of IOPs and the read/write speed on the filesystem.

I tried a few different mca settings, but I consistently observed that :
- The checkpoints lasted ~4-5 minutes each time
- During checkpoint, each node (8 ranks) was doing ~500 IOPs, and
writing at ~15MB/s.

I am worried by the number of IOPs and the very slow writing speed. This
was a very small test. We have jobs running with 128 or 256 MPI ranks,
each using 1-2 GB of ram per rank. With such jobs, the overall number of
IOPs would reach tens of thousands and would completely overload our
lustre filesystem. Moreover, with 15MB/s per node, the checkpointing
process would take hours.

How can I improve on that ? Is there an MCA setting that I am missing ?

Thanks,

-- 
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique