Dear All,
I am trying to checkpoint a test application using openmpi-1.3.1,
but fails to do so, when run multiple process on different nodes.
Checkpointing runs fine, if process is running on the same node along
with mpirun process. But the moment i launch MPI process from different
node, it hangs.
ex.
mpirun -np 2 ./test (will checkpoint fine using ompi-checkpoint
-v <mpirun_pid> )
but
mpirun -np 2 -H host1 ./test (Checkpointing will hang)
Similarly
mpirun -np 2 -H localhost,host1 ./test would still hangs while checkpointing.
Please find the output which comes while checkpointing
--------------xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx----------------------------
[n0:01596] orte_checkpoint: Checkpointing...
[n0:01596] PID 1514
[n0:01596] Connected to Mpirun [[11946,0],0]
[n0:01596] orte_checkpoint: notify_hnp: Contact Head Node Process PID 1514
[n0:01596] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid
[INVALID]
[n0:01596] orte_checkpoint: hnp_receiver: Receive a command message.
[n0:01596] orte_checkpoint: hnp_receiver: Status Update.
[n0:01596]
Requested - Global Snapshot Reference: (null)
[n0:01596] orte_checkpoint: hnp_receiver: Receive a command message.
[n0:01596] orte_checkpoint: hnp_receiver: Status Update.
[n0:01596]
Pending - Global Snapshot Reference: (null)
[n0:01596] orte_checkpoint: hnp_receiver: Receive a command message.
[n0:01596] orte_checkpoint: hnp_receiver: Status Update.
[n0:01596]
Running - Global Snapshot Reference: (null)
Note: It hangs here
------------------------------*******************************---------------------
Command used to launch program is
/usr/local/openmpi-1.3.1/install/bin/mpirun -np 2 -H n5 -am ft-enable-cr
--mca btl tcp,self a.out
And the dummy program is pretty simple as follows
#include<time.h>
#include<stdio.h>
#include<mpi.h>
#define LIMIT 10000000
main(int argc,char ** argv)
{
int i;
int
my_rank; /* Rank of process */
int
np; /* Number of process */
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&my_rank);
MPI_Comm_size(MPI_COMM_WORLD,
&np);
for(i=0; i<=LIMIT; i++)
{
printf("n HELLO %d",i);
//sleep(10);
MPI_Barrier(MPI_COMM_WORLD);
}
MPI_Finalize();
}
Let me know, what could be the error. I feel there is the error in MPI
process coordination.
Regards
Neeraj Chourasia
Member of Technical Staff
Computational Research Laboratories Limited
(A wholly Owned Subsidiary of TATA SONS Ltd)
P: +91.9890003757
=====-----=====-----=====
Notice: The information contained in this e-mail
message and/or attachments to it may contain
confidential or privileged information. If you are
not the intended recipient, any dissemination, use,
review, distribution, printing or copying of the
information contained in this e-mail message
and/or attachments to it are strictly prohibited. If
you have received this communication in error,
please notify us by reply e-mail or telephone and
immediately and permanently delete the message
and any attachments.
Internet communications cannot be guaranteed to be timely,
secure, error or virus-free. The sender does not accept liability
for any errors or omissions.Thank you
=====-----=====-----=====