I asked this question because checkpointing with to NFS is successful, but checkpointing without a mount filesystem or a shared storage throws this warning&error:
WARNING: Could not preload specified file: File already exists.
Will continue attempting to launch the process.
filem:rsh: wait_all(): Wait failed (-1)
[[62871,0],0] ORTE_ERROR_LOG: Error in file snapc_full_global.c at line 1054
even if I set the mca-parameters like this:
snapc_base_store_in_place=0and the nodes can connect through ssh without a password.
Let's say I have an MPI application running on several hosts. Is there any way to checkpoint this application without having a shared storage between the nodes?
I already took a look at the examples here http://www.osl.iu.edu/research/ft/ompi-cr/examples.php, but it seems that in both cases there is a globally mounted file system.