Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Fw: Problem with checkpointing multihosts, multiprocesses MPI application
From: Kritiraj Sajadah (ksajadah_at_[hidden])
Date: 2010-01-02 07:09:41


HI Averyone,
              Happy new year 2010. A few weeks ago I posted a query (please see email below) regarding checkpointing applications running on multiple hosts. I am still struggling to find a solution. I would really appreciate if someone could help me.

Thank you.

Raj
     
        

--- On Sat, 12/12/09, Kritiraj Sajadah <ksajadah_at_[hidden]> wrote:

> From: Kritiraj Sajadah <ksajadah_at_[hidden]>
> Subject: Problem with checkpointing multihosts, multiprocesses MPI application
> To: users_at_[hidden]
> Date: Saturday, December 12, 2009, 3:03 PM
> Dear All,
>          I am trying to
> checkpoint am MPI application which has two processes each
> running on two seperate hosts.
>
> I run the application as follows:
>
> raj_at_sun32:~$ mpirun -am ft-enable-cr -np 2 --hostfile
> sunhost -mca btl ^openib -mca snapc_base_global_snapshot_dir
> /tmp m.
>
> and I trigger the checkpoint as follows:
>
> raj_at_sun32:~$ ompi-checkpoint -v 30010
>
>
> The following happens displaying two errors which
> checkpointng the application:
>
>
> ##############################################
> I am processor no 0 of a total of 2 procs on host sun32
> I am processor no 1 of a total of 2 procs on host sun06
> I am processorrrrrrrr no 0 of a total of 2 procs on host
> sun32
> I am processorrrrrrrr no 1 of a total of 2 procs on host
> sun06
>
> [sun32:30010] Error: expected_component: PID information
> unavailable!
> [sun32:30010] Error: expected_component: Component Name
> information unavailable!
>
> I am processssssssssssor no 1 of a total of 2 procs on host
> sun06
> I am processssssssssssor no 0 of a total of 2 procs on host
> sun32
> bye
> bye
> ############################################
>
>
>
>
> when I try to restart the application from the checkpointed
> file, I get the following:
>
> raj_at_sun32:~$ ompi-restart ompi_global_snapshot_30010.ckpt
> --------------------------------------------------------------------------
> Error: The filename (opal_snapshot_1.ckpt) is invalid
> because either you have not provided a filename
>        or provided an invalid
> filename.
>        Please see --help for
> usage.
>
> --------------------------------------------------------------------------
> I am processssssssssssor no 0 of a total of 2 procs on host
> sun32
> bye
>
>
> I would very appreciate if you could give me some ideas on
> how to checkpoint and restart MPI application running on
> multiple hosts.
>
> Thank you
>
> Regards,
>
> Raj
>
>
>      
>