Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Checkpointing hangs with OpenMPI-1.3.1
From: neeraj_at_[hidden]
Date: 2009-04-10 04:56:58


Dear All,

   I am trying to checkpoint a test application using openmpi-1.3.1, but
fails to do so, when run multiple process on different nodes.

 Checkpointing runs fine, if process is running on the same node along
with mpirun process. But the moment i launch MPI process from different
node, it hangs.

 ex.
   mpirun -np 2 ./test (will checkpoint fine using ompi-checkpoint -v
<mpirun_pid> )
  but
  mpirun -np 2 -H host1 ./test (Checkpointing will hang)

Similarly
  mpirun -np 2 -H localhost,host1 ./test would still hangs while
checkpointing.

Please find the output which comes while checkpointing

--------------xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx----------------------------
[n0:01596] orte_checkpoint: Checkpointing...
[n0:01596] PID 1514
[n0:01596] Connected to Mpirun [[11946,0],0]
[n0:01596] orte_checkpoint: notify_hnp: Contact Head Node Process PID 1514

[n0:01596] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid
[INVALID]
[n0:01596] orte_checkpoint: hnp_receiver: Receive a command message.
[n0:01596] orte_checkpoint: hnp_receiver: Status Update.
[n0:01596] Requested - Global Snapshot Reference: (null)
[n0:01596] orte_checkpoint: hnp_receiver: Receive a command message.
[n0:01596] orte_checkpoint: hnp_receiver: Status Update.
[n0:01596] Pending - Global Snapshot Reference: (null)
[n0:01596] orte_checkpoint: hnp_receiver: Receive a command message.
[n0:01596] orte_checkpoint: hnp_receiver: Status Update.
[n0:01596] Running - Global Snapshot Reference: (null)

Note: It hangs here

------------------------------*******************************---------------------

Command used to launch program is

/usr/local/openmpi-1.3.1/install/bin/mpirun -np 2 -H n5 -am ft-enable-cr
--mca btl tcp,self a.out

And the dummy program is pretty simple as follows

#include<time.h>
#include<stdio.h>
#include<mpi.h>
 
 
#define LIMIT 10000000
 
main(int argc,char ** argv)
{
        int i;
 
            int my_rank; /* Rank of process */
            int np; /* Number of process */
 
 
            MPI_Init(&argc,&argv);
            MPI_Comm_rank(MPI_COMM_WORLD,&my_rank);
            MPI_Comm_size(MPI_COMM_WORLD, &np);
 
 
             for(i=0; i<=LIMIT; i++)
             {
                printf("n HELLO %d",i);
                        //sleep(10);
                        MPI_Barrier(MPI_COMM_WORLD);
        }
            MPI_Finalize();
}
 
 

Let me know, what could be the error. I feel there is the error in MPI
process coordination.

Regards

Neeraj Chourasia
Member of Technical Staff
Computational Research Laboratories Limited
(A wholly Owned Subsidiary of TATA SONS Ltd)
P: +91.9890003757

=====-----=====-----=====

Notice: The information contained in this e-mail
message and/or attachments to it may contain
confidential or privileged information. If you are
not the intended recipient, any dissemination, use,
review, distribution, printing or copying of the
information contained in this e-mail message
and/or attachments to it are strictly prohibited. If
you have received this communication in error,
please notify us by reply e-mail or telephone and
immediately and permanently delete the message
and any attachments.

Internet communications cannot be guaranteed to be timely,
secure, error or virus-free. The sender does not accept liability
for any errors or omissions.Thank you

=====-----=====-----=====