Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Application hangs when checkpointing application (update)
From: Jean Potsam (jeanpotsam_at_[hidden])
Date: 2009-09-11 09:50:06


 
Hi Everyone,
              I noticed that it hangs just before displaying the following while trying to checkpoint the application.
 
############################
[sun06:15252] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID] 
###############################
 
Can it be related to the above?
 
Thanks
 
 
----------------------------------------------------------------------------------------------------------------------
Hi Everyone,
                    I wrote a small program with a function to trigger the checkpointing mechanism as follows:
 
############################################
 
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <signal.h>
void trigger_checkpoint();
int main(int argc, char **argv)
{
int rank,size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("I am processor no %d of a total of %d procs \n", rank, size);
system("sleep 10");
trigger_checkpoint();
printf("I am processor no %d of a total of %d procs \n", rank, size);
system("sleep 10");
printf("I am processor no %d of a total of %d procs \n", rank, size);
system("sleep 10");
printf("bye \n");
MPI_Finalize();
return 0;
}
 
void trigger_checkpoint()
{
  printf("hi\n");
  system("ompi-checkpoint -v `pidof mpirun` ");
}
#############################################
      
 
The application works fine on my laptop with ubuntu as the OS. However, when I tried running it on one of the machines at my uni, with suse linux installed, the application hangs as soon as the ompi-checkpoint is triggered. This is what I get:
 
 
 
##########################################################
I am processor no 0 of a total of 1 procs
hi
I am processor no 0 of a total of 1 procs
[sun06:15426] orte_checkpoint: Checkpointing...
[sun06:15426]    PID 15411
[sun06:15426]    Connected to Mpirun [[12727,0],0]
[sun06:15426] orte_checkpoint: notify_hnp: Contact Head Node Process PID 15411
###################################################

 
does anyone has some ideas about this?
 
Thanks a lot
 
Jean.