Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Application hangs when checkpointing application (update)
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2009-09-14 09:27:00


Is your application running on the same machine as mpirun?

How did you configure Open MPI? Note that is program will not work
without the FT thread enabled, which would be one reason why it would
seem to hang (since it is waiting for the application to enter the MPI
library):
   --enable-ft-thread --enable-mpi-threads

I do not think the message that you saw is related. Often
orte_checkpoint cannot figure out the jobid on first contact with the
HNP/mpirun process, so this is displayed as an INVALID handle.

-- Josh

On Sep 11, 2009, at 9:50 AM, Jean Potsam wrote:

>
> Hi Everyone,
> I noticed that it hangs just before displaying the
> following while trying to checkpoint the application.
>
> ############################
> [sun06:15252] orte_checkpoint: notify_hnp: Requested a checkpoint of
> jobid [INVALID]
> ###############################
>
> Can it be related to the above?
>
> Thanks
>
>
> ----------------------------------------------------------------------------------------------------------------------
> Hi Everyone,
> I wrote a small program with a function to
> trigger the checkpointing mechanism as follows:
>
> ############################################
>
> #include <mpi.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <signal.h>
> void trigger_checkpoint();
> int main(int argc, char **argv)
> {
> int rank,size;
> MPI_Init(&argc, &argv);
> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> MPI_Comm_size(MPI_COMM_WORLD, &size);
> printf("I am processor no %d of a total of %d procs \n", rank, size);
> system("sleep 10");
> trigger_checkpoint();
> printf("I am processor no %d of a total of %d procs \n", rank, size);
> system("sleep 10");
> printf("I am processor no %d of a total of %d procs \n", rank, size);
> system("sleep 10");
> printf("bye \n");
> MPI_Finalize();
> return 0;
> }
>
> void trigger_checkpoint()
> {
> printf("hi\n");
> system("ompi-checkpoint -v `pidof mpirun` ");
> }
> #############################################
>
>
> The application works fine on my laptop with ubuntu as the OS.
> However, when I tried running it on one of the machines at my uni,
> with suse linux installed, the application hangs as soon as the ompi-
> checkpoint is triggered. This is what I get:
>
>
>
> ##########################################################
> I am processor no 0 of a total of 1 procs
> hi
> I am processor no 0 of a total of 1 procs
> [sun06:15426] orte_checkpoint: Checkpointing...
> [sun06:15426] PID 15411
> [sun06:15426] Connected to Mpirun [[12727,0],0]
> [sun06:15426] orte_checkpoint: notify_hnp: Contact Head Node Process
> PID 15411
> ###################################################
>
> does anyone has some ideas about this?
>
> Thanks a lot
>
> Jean.
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users