Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] openmpi self checkpointing - error while running example
From: Hellmüller Roman (hroman_at_[hidden])
Date: 2011-04-06 06:05:45


Hi

I'm trying to get fault tolerant ompi running on our cluster for my semesterthesis.

Build & compile were successful, blcr checkpointing works. openmpi 1.5.3, blcr 0.8.2

Now i'm trying to set up the SELF checkpointing. the example from http://osl.iu.edu/research/ft/ompi-cr/examples.php does not work. I can run the application and also do checkpoints, but restarting won't work. I got the following error by doning as sugested:

mpicc my-app.c -export -export-dynamic -o my-app

mpirun -np 2 -am ft-enable-cr -mca crs_self_prefix my_personal my-app

hroman_at_cbl1 ~ $ ompi-restart ompi_global_snapshot_27167.ckpt/
--------------------------------------------------------------------------
Error: Unable to obtain the proper restart command to restart from the
       checkpoint file (opal_snapshot_0.ckpt). Returned -1.

--------------------------------------------------------------------------
--------------------------------------------------------------------------
Error: Unable to obtain the proper restart command to restart from the
       checkpoint file (opal_snapshot_1.ckpt). Returned -1.

--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------

i also tryed around with setting the path in the example file (restart_path variable), changing the checkpoint directorys, and running the application in different directorys...

do you have an idea where the error could be?

here http://n.ethz.ch/~hroman/downloads/ompi_mailinglist.tar.gz> (40MB) you'll find the library and the build of openmpi & blcr as well as the env variables and the output of ompi_info. there is one for the login and the other for the compute nodes due to different kernels. and here http://n.ethz.ch/~hroman/downloads/ompi_global_snapshot_27167.ckpt.tar.gz> there is the produced checkpoint. please let me know if more outputs are needed.

cheers
roman