Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Application hangs when checkpointing application (update)
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2009-09-17 11:21:54


Interesting. I'll try to take a look and see if I can reproduce today.

-- Josh

On Sep 14, 2009, at 4:54 PM, Jean Potsam wrote:

> Hi Josh,
> Thanks for the response. I am actually testing it on a
> single node (though in the near future i will run it on a set of
> nodes). Therefore, my application is running on the same machine as
> mpirun.
> When I run the application and triggers the checkpointing mechanism
> from a seperate terminal, it checkpoints fine.
>
> However, when I try to checkpoint it from within the main program as
> show below, it hangs.
>
> kind regards,
>
> Jean
>
>
> --- On Mon, 14/9/09, Josh Hursey <jjhursey_at_[hidden]> wrote:
>
> From: Josh Hursey <jjhursey_at_[hidden]>
> Subject: Re: [OMPI users] Application hangs when checkpointing
> application (update)
> To: "Open MPI Users" <users_at_[hidden]>
> Date: Monday, 14 September, 2009, 1:27 PM
>
> Is your application running on the same machine as mpirun?
>
> How did you configure Open MPI? Note that is program will not work
> without the FT thread enabled, which would be one reason why it
> would seem to hang (since it is waiting for the application to enter
> the MPI library):
> --enable-ft-thread --enable-mpi-threads
>
> I do not think the message that you saw is related. Often
> orte_checkpoint cannot figure out the jobid on first contact with
> the HNP/mpirun process, so this is displayed as an INVALID handle.
>
> -- Josh
>
> On Sep 11, 2009, at 9:50 AM, Jean Potsam wrote:
>
> >
> > Hi Everyone,
> > I noticed that it hangs just before displaying the
> following while trying to checkpoint the application.
> >
> > ############################
> > [sun06:15252] orte_checkpoint: notify_hnp: Requested a checkpoint
> of jobid [INVALID]
> > ###############################
> >
> > Can it be related to the above?
> >
> > Thanks
> >
> >
> >
> ----------------------------------------------------------------------------------------------------------------------
> > Hi Everyone,
> > I wrote a small program with a function to
> trigger the checkpointing mechanism as follows:
> >
> > ############################################
> >
> > #include <mpi.h>
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <unistd.h>
> > #include <signal.h>
> > void trigger_checkpoint();
> > int main(int argc, char **argv)
> > {
> > int rank,size;
> > MPI_Init(&argc, &argv);
> > MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> > MPI_Comm_size(MPI_COMM_WORLD, &size);
> > printf("I am processor no %d of a total of %d procs \n", rank,
> size);
> > system("sleep 10");
> > trigger_checkpoint();
> > printf("I am processor no %d of a total of %d procs \n", rank,
> size);
> > system("sleep 10");
> > printf("I am processor no %d of a total of %d procs \n", rank,
> size);
> > system("sleep 10");
> > printf("bye \n");
> > MPI_Finalize();
> > return 0;
> > }
> >
> > void trigger_checkpoint()
> > {
> > printf("hi\n");
> > system("ompi-checkpoint -v `pidof mpirun` ");
> > }
> > #############################################
> >
> >
> > The application works fine on my laptop with ubuntu as the OS.
> However, when I tried running it on one of the machines at my uni,
> with suse linux installed, the application hangs as soon as the ompi-
> checkpoint is triggered. This is what I get:
> >
> >
> >
> > ##########################################################
> > I am processor no 0 of a total of 1 procs
> > hi
> > I am processor no 0 of a total of 1 procs
> > [sun06:15426] orte_checkpoint: Checkpointing...
> > [sun06:15426] PID 15411
> > [sun06:15426] Connected to Mpirun [[12727,0],0]
> > [sun06:15426] orte_checkpoint: notify_hnp: Contact Head Node
> Process PID 15411
> > ###################################################
> >
> > does anyone has some ideas about this?
> >
> > Thanks a lot
> >
> > Jean.
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users