Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Checkpointing hangs with OpenMPI-1.3.1
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2009-04-27 15:00:19


I still have not been able to reproduce the hang, but I'm still
looking into it.

I did commit a fix for the datatype copy error that I mentioned
(r21080 in the Open MPI trunk, and it is in the pipeline for v1.3).

Can you put in a print statement before MPI_Finalize, then try the
program again? I am wondering if the problem is not with the
MPI_Barrier, but MPI_Finalize. I wonder if one (or both) of the
processes enter MPI_Finalize while a checkpoint is occurring.
Unfortunately, I have not tested the MPI_Finalize scenario in a long
time, but will put that on my todo list.

Cheers,
Josh

On Apr 27, 2009, at 9:48 AM, Josh Hursey wrote:

> Sorry for the long delay to respond.
>
> It is a bit odd that the hang does not occur when running on only
> one host. I suspect that is more due to timing than anything else.
>
> I am not able to reproduce the hang at the moment, but I do get an
> occasional datatype copy error which could be symptomatic of a
> related problem. I'll dig into this a bit more this week and let you
> know when I have a fix and if I can reproduce the hang.
>
> Thanks for the bug report.
>
> Cheers,
> Josh
>
> On Apr 10, 2009, at 4:56 AM, neeraj_at_[hidden] wrote:
>
>>
>> Dear All,
>>
>> I am trying to checkpoint a test application using openmpi-1.3.1,
>> but fails to do so, when run multiple process on different nodes.
>>
>> Checkpointing runs fine, if process is running on the same node
>> along with mpirun process. But the moment i launch MPI process from
>> different node, it hangs.
>>
>> ex.
>> mpirun -np 2 ./test (will checkpoint fine using ompi-checkpoint -
>> v <mpirun_pid> )
>> but
>> mpirun -np 2 -H host1 ./test (Checkpointing will hang)
>>
>> Similarly
>> mpirun -np 2 -H localhost,host1 ./test would still hangs while
>> checkpointing.
>>
>> Please find the output which comes while checkpointing
>>
>> --------------
>> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx----------------------------
>> [n0:01596] orte_checkpoint: Checkpointing...
>> [n0:01596] PID 1514
>> [n0:01596] Connected to Mpirun [[11946,0],0]
>> [n0:01596] orte_checkpoint: notify_hnp: Contact Head Node Process
>> PID 1514
>> [n0:01596] orte_checkpoint: notify_hnp: Requested a checkpoint of
>> jobid [INVALID]
>> [n0:01596] orte_checkpoint: hnp_receiver: Receive a command message.
>> [n0:01596] orte_checkpoint: hnp_receiver: Status Update.
>> [n0:01596] Requested - Global Snapshot Reference:
>> (null)
>> [n0:01596] orte_checkpoint: hnp_receiver: Receive a command message.
>> [n0:01596] orte_checkpoint: hnp_receiver: Status Update.
>> [n0:01596] Pending - Global Snapshot Reference:
>> (null)
>> [n0:01596] orte_checkpoint: hnp_receiver: Receive a command message.
>> [n0:01596] orte_checkpoint: hnp_receiver: Status Update.
>> [n0:01596] Running - Global Snapshot Reference:
>> (null)
>>
>> Note: It hangs here
>>
>> ------------------------------
>> *******************************---------------------
>>
>> Command used to launch program is
>>
>> /usr/local/openmpi-1.3.1/install/bin/mpirun -np 2 -H n5 -am ft-
>> enable-cr --mca btl tcp,self a.out
>>
>> And the dummy program is pretty simple as follows
>>
>> #include<time.h>
>> #include<stdio.h>
>> #include<mpi.h>
>>
>>
>> #define LIMIT 10000000
>>
>> main(int argc,char ** argv)
>> {
>> int i;
>>
>> int my_rank; /* Rank of process */
>> int np; /* Number of process */
>>
>>
>> MPI_Init(&argc,&argv);
>> MPI_Comm_rank(MPI_COMM_WORLD,&my_rank);
>> MPI_Comm_size(MPI_COMM_WORLD, &np);
>>
>>
>> for(i=0; i<=LIMIT; i++)
>> {
>> printf("n HELLO %d",i);
>> //sleep(10);
>> MPI_Barrier(MPI_COMM_WORLD);
>> }
>> MPI_Finalize();
>> }
>>
>>
>>
>> Let me know, what could be the error. I feel there is the error in
>> MPI process coordination.
>>
>> Regards
>>
>>
>> Neeraj Chourasia
>> Member of Technical Staff
>> Computational Research Laboratories Limited
>> (A wholly Owned Subsidiary of TATA SONS Ltd)
>> P: +91.9890003757
>>
>> =====-----=====-----===== Notice: The information contained in this
>> e-mail message and/or attachments to it may contain confidential or
>> privileged information. If you are not the intended recipient, any
>> dissemination, use, review, distribution, printing or copying of
>> the information contained in this e-mail message and/or attachments
>> to it are strictly prohibited. If you have received this
>> communication in error, please notify us by reply e-mail or
>> telephone and immediately and permanently delete the message and
>> any attachments. Internet communications cannot be guaranteed to be
>> timely, secure, error or virus-free. The sender does not accept
>> liability for any errors or omissions.Thank you =====-----=====-----
>> =====
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users