Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Application hangs when checkpointing application (update)
From: Jean Potsam (jeanpotsam_at_[hidden])
Date: 2009-09-14 16:54:26


Hi Josh,
           Thanks for the response. I am actually testing it on a single node (though in the near future i will run it on a set of nodes). Therefore, my application is running on the same machine as mpirun.
When I run the application and triggers the checkpointing mechanism from a seperate terminal, it checkpoints fine.

However, when I try to checkpoint it from within the main program as show below, it hangs.

kind regards,

Jean

--- On Mon, 14/9/09, Josh Hursey <jjhursey_at_[hidden]> wrote:

From: Josh Hursey <jjhursey_at_[hidden]>
Subject: Re: [OMPI users] Application hangs when checkpointing application (update)
To: "Open MPI Users" <users_at_[hidden]>
Date: Monday, 14 September, 2009, 1:27 PM

Is your application running on the same machine as mpirun?

How did you configure Open MPI? Note that is program will not work without the FT thread enabled, which would be one reason why it would seem to hang (since it is waiting for the application to enter the MPI library):
  --enable-ft-thread --enable-mpi-threads

I do not think the message that you saw is related. Often orte_checkpoint cannot figure out the jobid on first contact with the HNP/mpirun process, so this is displayed as an INVALID handle.

-- Josh

On Sep 11, 2009, at 9:50 AM, Jean Potsam wrote:

>
> Hi Everyone,
>               I noticed that it hangs just before displaying the following while trying to checkpoint the application.
>
> ############################
> [sun06:15252] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID]
> ###############################
>
> Can it be related to the above?
>
> Thanks
>
>
> ----------------------------------------------------------------------------------------------------------------------
> Hi Everyone,
>                     I wrote a small program with a function to trigger the checkpointing mechanism as follows:
>
> ############################################
>
> #include <mpi.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <signal.h>
> void trigger_checkpoint();
> int main(int argc, char **argv)
> {
> int rank,size;
> MPI_Init(&argc, &argv);
> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> MPI_Comm_size(MPI_COMM_WORLD, &size);
> printf("I am processor no %d of a total of %d procs \n", rank, size);
> system("sleep 10");
> trigger_checkpoint();
> printf("I am processor no %d of a total of %d procs \n", rank, size);
> system("sleep 10");
> printf("I am processor no %d of a total of %d procs \n", rank, size);
> system("sleep 10");
> printf("bye \n");
> MPI_Finalize();
> return 0;
> }
>
> void trigger_checkpoint()
> {
>   printf("hi\n");
>   system("ompi-checkpoint -v `pidof mpirun` ");
> }
> #############################################
>
>
> The application works fine on my laptop with ubuntu as the OS. However, when I tried running it on one of the machines at my uni, with suse linux installed, the application hangs as soon as the ompi-checkpoint is triggered. This is what I get:
>
>
>
> ##########################################################
> I am processor no 0 of a total of 1 procs
> hi
> I am processor no 0 of a total of 1 procs
> [sun06:15426] orte_checkpoint: Checkpointing...
> [sun06:15426]    PID 15411
> [sun06:15426]    Connected to Mpirun [[12727,0],0]
> [sun06:15426] orte_checkpoint: notify_hnp: Contact Head Node Process PID 15411
> ###################################################
>
> does anyone has some ideas about this?
>
> Thanks a lot
>
> Jean.
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users