Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi self checkpointing - error while running example
From: Hellmüller Roman (hroman_at_[hidden])
Date: 2011-04-06 09:30:43


Hi Toan

no that didn't change anything. i'm trying to restart the program on the computer it run before and i execute the ompi-restart on the same.

machinefile_cbl1 contains just cbl1

hroman_at_cbl1 ~/checkpoints $ ompi-restart -v -machinefile machinefile_cbl1 ompi_global_snapshot_28952.ckpt/
[cbl1:30308] Checking for the existence of (/home/hroman/checkpoints/ompi_global_snapshot_28952.ckpt)
[cbl1:30308] Restarting from file (ompi_global_snapshot_28952.ckpt/)
[cbl1:30308] Exec in self
--------------------------------------------------------------------------
Error: Unable to obtain the proper restart command to restart from the
       checkpoint file (opal_snapshot_0.ckpt). Returned -1.

--------------------------------------------------------------------------
--------------------------------------------------------------------------
Error: Unable to obtain the proper restart command to restart from the
       checkpoint file (opal_snapshot_1.ckpt). Returned -1.

--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------

cheers
roman

________________________________
Von: users-bounces_at_[hidden] [users-bounces_at_[hidden]]" im Auftrag von "Nguyen Toan [nguyentoan1508_at_[hidden]]
Gesendet: Mittwoch, 6. April 2011 15:00
Bis: Open MPI Users
Betreff: Re: [OMPI users] openmpi self checkpointing - error while running example

Hi Roman,

It seems that you misunderstand the parameter "-machinefile".
Following this parameter shoud be a file containing a list of machines
which your MPI application will be run on. For example, you want to
run your app on 2 nodes, named "node1" and "node2", then this file, let call it "MACHINES_FILE", should look like this:

node1
node2

Now try to checkpoint and restart again with "-machinefile MACHINES_FILE". Hope it works.

On Wed, Apr 6, 2011 at 9:13 PM, Hellmüller Roman <hroman_at_[hidden]<mailto:hroman_at_[hidden]>> wrote:
Hi Toan

Thx for your suggestion. It gives me the following result, which does not tell anything more.

hroman_at_cbl1 ~/checkpoints $ ompi-restart -v -machinefile ../semesterthesis/code/code2_self_example/my-hroman-cr-file.ckpt om
pi_global_snapshot_28952.ckpt/
[cbl1:28974] Checking for the existence of (/home/hroman/checkpoints/ompi_global_snapshot_28952.ckpt)
[cbl1:28974] Restarting from file (ompi_global_snapshot_28952.ckpt/)
[cbl1:28974] Exec in self
ssh: connect to host 15 port 22: Invalid argument
--------------------------------------------------------------------------
A daemon (pid 28975) died unexpectedly with status 255 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
hroman_at_cbl1 ~/checkpoints $ echo $LD_LIBRARY_PATH
/cluster/sw/blcr/0.8.2/x86_64/gcc//lib:/cluster/sw/openmpi/1.5.3_ft/x86_64/gcc/lib:/opt/intel/Compiler/11.1/056/lib/intel64

The library path seems to be ok or should it look different? do you have another idea?
cheers
roman

________________________________
Von: users-bounces_at_[hidden]<mailto:users-bounces_at_[hidden]> [users-bounces_at_[hidden]<mailto:users-bounces_at_[hidden]>]" im Auftrag von "Nguyen Toan [nguyentoan1508_at_[hidden]<mailto:nguyentoan1508_at_[hidden]>]
Gesendet: Mittwoch, 6. April 2011 13:20
Bis: Open MPI Users
Betreff: Re: [OMPI users] openmpi self checkpointing - error while running example

Hi Roman,

Did you try to checkpoint and restart with the parameter "-machinefile". It may work.

Regards,
Nguyen Toan

On Wed, Apr 6, 2011 at 7:05 PM, Hellmüller Roman <hroman_at_[hidden]<mailto:hroman_at_[hidden]><mailto:hroman_at_[hidden]<mailto:hroman_at_[hidden]>>> wrote:
Hi

I'm trying to get fault tolerant ompi running on our cluster for my semesterthesis.

Build & compile were successful, blcr checkpointing works. openmpi 1.5.3, blcr 0.8.2

Now i'm trying to set up the SELF checkpointing. the example from http://osl.iu.edu/research/ft/ompi-cr/examples.php does not work. I can run the application and also do checkpoints, but restarting won't work. I got the following error by doning as sugested:

mpicc my-app.c -export -export-dynamic -o my-app

mpirun -np 2 -am ft-enable-cr -mca crs_self_prefix my_personal my-app

hroman_at_cbl1 ~ $ ompi-restart ompi_global_snapshot_27167.ckpt/
--------------------------------------------------------------------------
Error: Unable to obtain the proper restart command to restart from the
     checkpoint file (opal_snapshot_0.ckpt). Returned -1.

--------------------------------------------------------------------------
--------------------------------------------------------------------------
Error: Unable to obtain the proper restart command to restart from the
     checkpoint file (opal_snapshot_1.ckpt). Returned -1.

--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------

i also tryed around with setting the path in the example file (restart_path variable), changing the checkpoint directorys, and running the application in different directorys...

do you have an idea where the error could be?

here http://n.ethz.ch/~hroman/downloads/ompi_mailinglist.tar.gz> (40MB) you'll find the library and the build of openmpi & blcr as well as the env variables and the output of ompi_info. there is one for the login and the other for the compute nodes due to different kernels. and here http://n.ethz.ch/~hroman/downloads/ompi_global_snapshot_27167.ckpt.tar.gz> there is the produced checkpoint. please let me know if more outputs are needed.

cheers
roman

_______________________________________________
users mailing list
users_at_[hidden]<mailto:users_at_[hidden]><mailto:users_at_[hidden]<mailto:users_at_[hidden]>>
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users_at_[hidden]<mailto:users_at_[hidden]>
http://www.open-mpi.org/mailman/listinfo.cgi/users