Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi self checkpointing - error while running example
From: Nguyen Toan (nguyentoan1508_at_[hidden])
Date: 2011-04-06 09:00:45


Hi Roman,

It seems that you misunderstand the parameter "-machinefile".
Following this parameter shoud be a file containing a list of machines
which your MPI application will be run on. For example, you want to
run your app on 2 nodes, named "node1" and "node2", then this file, let call
it "MACHINES_FILE", should look like this:

node1
node2

Now try to checkpoint and restart again with "-machinefile MACHINES_FILE".
Hope it works.

On Wed, Apr 6, 2011 at 9:13 PM, Hellmüller Roman <hroman_at_[hidden]>wrote:

> Hi Toan
>
> Thx for your suggestion. It gives me the following result, which does not
> tell anything more.
>
> hroman_at_cbl1 ~/checkpoints $ ompi-restart -v -machinefile
> ../semesterthesis/code/code2_self_example/my-hroman-cr-file.ckpt om
> pi_global_snapshot_28952.ckpt/
> [cbl1:28974] Checking for the existence of
> (/home/hroman/checkpoints/ompi_global_snapshot_28952.ckpt)
> [cbl1:28974] Restarting from file (ompi_global_snapshot_28952.ckpt/)
> [cbl1:28974] Exec in self
> ssh: connect to host 15 port 22: Invalid argument
> --------------------------------------------------------------------------
> A daemon (pid 28975) died unexpectedly with status 255 while attempting
> to launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> hroman_at_cbl1 ~/checkpoints $ echo $LD_LIBRARY_PATH
>
> /cluster/sw/blcr/0.8.2/x86_64/gcc//lib:/cluster/sw/openmpi/1.5.3_ft/x86_64/gcc/lib:/opt/intel/Compiler/11.1/056/lib/intel64
>
> The library path seems to be ok or should it look different? do you have
> another idea?
> cheers
> roman
>
> ________________________________
> Von: users-bounces_at_[hidden] [users-bounces_at_[hidden]]" im Auftrag
> von "Nguyen Toan [nguyentoan1508_at_[hidden]]
> Gesendet: Mittwoch, 6. April 2011 13:20
> Bis: Open MPI Users
> Betreff: Re: [OMPI users] openmpi self checkpointing - error while running
> example
>
> Hi Roman,
>
> Did you try to checkpoint and restart with the parameter "-machinefile". It
> may work.
>
> Regards,
> Nguyen Toan
>
> On Wed, Apr 6, 2011 at 7:05 PM, Hellmüller Roman <hroman_at_[hidden]
> <mailto:hroman_at_[hidden]>> wrote:
> Hi
>
> I'm trying to get fault tolerant ompi running on our cluster for my
> semesterthesis.
>
> Build & compile were successful, blcr checkpointing works. openmpi 1.5.3,
> blcr 0.8.2
>
> Now i'm trying to set up the SELF checkpointing. the example from
> http://osl.iu.edu/research/ft/ompi-cr/examples.php does not work. I can
> run the application and also do checkpoints, but restarting won't work. I
> got the following error by doning as sugested:
>
> mpicc my-app.c -export -export-dynamic -o my-app
>
> mpirun -np 2 -am ft-enable-cr -mca crs_self_prefix my_personal my-app
>
> hroman_at_cbl1 ~ $ ompi-restart ompi_global_snapshot_27167.ckpt/
> --------------------------------------------------------------------------
> Error: Unable to obtain the proper restart command to restart from the
> checkpoint file (opal_snapshot_0.ckpt). Returned -1.
>
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> Error: Unable to obtain the proper restart command to restart from the
> checkpoint file (opal_snapshot_1.ckpt). Returned -1.
>
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
>
> i also tryed around with setting the path in the example file (restart_path
> variable), changing the checkpoint directorys, and running the application
> in different directorys...
>
> do you have an idea where the error could be?
>
> here http://n.ethz.ch/~hroman/downloads/ompi_mailinglist.tar.gz<
> http://n.ethz.ch/%7Ehroman/downloads/ompi_mailinglist.tar.gz> (40MB)
> you'll find the library and the build of openmpi & blcr as well as the env
> variables and the output of ompi_info. there is one for the login and the
> other for the compute nodes due to different kernels. and here
> http://n.ethz.ch/~hroman/downloads/ompi_global_snapshot_27167.ckpt.tar.gz<
> http://n.ethz.ch/%7Ehroman/downloads/ompi_global_snapshot_27167.ckpt.tar.gz>
> there is the produced checkpoint. please let me know if more outputs are
> needed.
>
> cheers
> roman
>
> _______________________________________________
> users mailing list
> users_at_[hidden]<mailto:users_at_[hidden]>
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>