Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Checkpointing hangs with OpenMPI-1.3.1
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2009-04-27 09:48:09


Sorry for the long delay to respond.

It is a bit odd that the hang does not occur when running on only one
host. I suspect that is more due to timing than anything else.

I am not able to reproduce the hang at the moment, but I do get an
occasional datatype copy error which could be symptomatic of a related
problem. I'll dig into this a bit more this week and let you know when
I have a fix and if I can reproduce the hang.

Thanks for the bug report.

Cheers,
Josh

On Apr 10, 2009, at 4:56 AM, neeraj_at_[hidden] wrote:

>
> Dear All,
>
> I am trying to checkpoint a test application using openmpi-1.3.1,
> but fails to do so, when run multiple process on different nodes.
>
> Checkpointing runs fine, if process is running on the same node
> along with mpirun process. But the moment i launch MPI process from
> different node, it hangs.
>
> ex.
> mpirun -np 2 ./test (will checkpoint fine using ompi-checkpoint -
> v <mpirun_pid> )
> but
> mpirun -np 2 -H host1 ./test (Checkpointing will hang)
>
> Similarly
> mpirun -np 2 -H localhost,host1 ./test would still hangs while
> checkpointing.
>
> Please find the output which comes while checkpointing
>
> --------------
> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx----------------------------
> [n0:01596] orte_checkpoint: Checkpointing...
> [n0:01596] PID 1514
> [n0:01596] Connected to Mpirun [[11946,0],0]
> [n0:01596] orte_checkpoint: notify_hnp: Contact Head Node Process
> PID 1514
> [n0:01596] orte_checkpoint: notify_hnp: Requested a checkpoint of
> jobid [INVALID]
> [n0:01596] orte_checkpoint: hnp_receiver: Receive a command message.
> [n0:01596] orte_checkpoint: hnp_receiver: Status Update.
> [n0:01596] Requested - Global Snapshot Reference:
> (null)
> [n0:01596] orte_checkpoint: hnp_receiver: Receive a command message.
> [n0:01596] orte_checkpoint: hnp_receiver: Status Update.
> [n0:01596] Pending - Global Snapshot Reference:
> (null)
> [n0:01596] orte_checkpoint: hnp_receiver: Receive a command message.
> [n0:01596] orte_checkpoint: hnp_receiver: Status Update.
> [n0:01596] Running - Global Snapshot Reference:
> (null)
>
> Note: It hangs here
>
> ------------------------------
> *******************************---------------------
>
> Command used to launch program is
>
> /usr/local/openmpi-1.3.1/install/bin/mpirun -np 2 -H n5 -am ft-
> enable-cr --mca btl tcp,self a.out
>
> And the dummy program is pretty simple as follows
>
> #include<time.h>
> #include<stdio.h>
> #include<mpi.h>
>
>
> #define LIMIT 10000000
>
> main(int argc,char ** argv)
> {
> int i;
>
> int my_rank; /* Rank of process */
> int np; /* Number of process */
>
>
> MPI_Init(&argc,&argv);
> MPI_Comm_rank(MPI_COMM_WORLD,&my_rank);
> MPI_Comm_size(MPI_COMM_WORLD, &np);
>
>
> for(i=0; i<=LIMIT; i++)
> {
> printf("n HELLO %d",i);
> //sleep(10);
> MPI_Barrier(MPI_COMM_WORLD);
> }
> MPI_Finalize();
> }
>
>
>
> Let me know, what could be the error. I feel there is the error in
> MPI process coordination.
>
> Regards
>
>
> Neeraj Chourasia
> Member of Technical Staff
> Computational Research Laboratories Limited
> (A wholly Owned Subsidiary of TATA SONS Ltd)
> P: +91.9890003757
>
> =====-----=====-----===== Notice: The information contained in this
> e-mail message and/or attachments to it may contain confidential or
> privileged information. If you are not the intended recipient, any
> dissemination, use, review, distribution, printing or copying of the
> information contained in this e-mail message and/or attachments to
> it are strictly prohibited. If you have received this communication
> in error, please notify us by reply e-mail or telephone and
> immediately and permanently delete the message and any attachments.
> Internet communications cannot be guaranteed to be timely, secure,
> error or virus-free. The sender does not accept liability for any
> errors or omissions.Thank you =====-----=====-----=====
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users