Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] ompi-checkpoint is hanging
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2008-10-31 12:20:54


After some additional testing I believe that I have been able to
reproduce the problem. I suspect that there is a bug in the
coordination protocol that is causing an occasional hang in the
system. Since it only happens occasionally (though slightly more often
on a fully loaded machine) that is probably how I missed it in my
testing.

I'll work on a patch, and let you know when it is ready. Unfortunately
it probably won't be for a couple weeks. :(

You can increase the verbose level for all of the fault tolerance
frameworks and components through MCA parameters. They are referenced
in the FT C/R User Doc on the Open MPI wiki, and you can access them
through 'ompi-info'. You will look for the following frameworks/
components:
  - crs/blcr
  - snapc/full
  - crcp/bkmrk
  - opal_cr_verbose
  - orte_cr_verbose
  - ompi_cr_verbose

Thanks for the bug report. I filed a ticket in our bug tracker, and
CC'ed you on it. The ticket is:
   http://svn.open-mpi.org/trac/ompi/ticket/1619

Cheers,
Josh

On Oct 31, 2008, at 10:51 AM, Matthias Hovestadt wrote:

> Hi Tim!
>
> First of all: thanks a lot for answering! :-)
>
>
>> Could you try running your two MPI jobs with fewer procs each,
>> say 2 or 3 each instead of 4, so that there are a few extra cores
>> available.
>
> This problem occurrs with any number of procs.
>
>> Also, what happens to the checkpointing of one MPI job if you kill
>> the
>> other MPI job
>> after the first "hangs"?
>
> Nothing, it keeps hanging.
>
> > (It may not be a true hang, but very very slow progress that you
> > are observing.)
>
> I already waited for more than 12 hours, but the ompi-checkpoint
> did not return. So if it's slow, it must be very slow.
>
>
> I continued testing and just observed a case where the problem
> occurred with only one job running on the compute node:
>
> -------------------------------------------------------
> ccs_at_grid-demo-1:~$ ps auxww | grep mpirun | grep -v grep
> ccs 7706 0.4 0.2 63864 2640 ? S 15:35 0:00
> mpirun -np 1 -am ft-enable-cr -np 6 /home/ccs/XN-OMPI/testdrive/
> loop-1/remotedir/mpi-x-povray +I planet.pov -w1600 -h1200 +SP1 +O
> planet.tga
> ccs_at_grid-demo-1:~$
> -------------------------------------------------------
>
> The resource management system tried to checkpoint this job using the
> command "ompi-checkpoint -v --term 7706". This is the output of that
> command:
>
> -------------------------------------------------------
> [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: Checkpointing...
> [grid-demo-1.cit.tu-berlin.de:08178] PID 7706
> [grid-demo-1.cit.tu-berlin.de:08178] Connected to Mpirun
> [[3623,0],0]
> [grid-demo-1.cit.tu-berlin.de:08178] Terminating after checkpoint
> [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: notify_hnp:
> Contact Head Node Process PID 7706
> [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: notify_hnp:
> Requested a checkpoint of jobid [INVALID]
> [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver:
> Receive a command message.
> [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver:
> Status Update.
> [grid-demo-1.cit.tu-berlin.de:08178] Requested -
> Global Snapshot Reference: (null)
> [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver:
> Receive a command message.
> [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver:
> Status Update.
> [grid-demo-1.cit.tu-berlin.de:08178] Pending (Termination) -
> Global Snapshot Reference: (null)
> [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver:
> Receive a command message.
> [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver:
> Status Update.
> [grid-demo-1.cit.tu-berlin.de:08178] Running -
> Global Snapshot Reference: (null)
> -------------------------------------------------------
>
> If I look to the activity on the node, I see that the processes
> are still computing:
>
> -------------------------------------------------------
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 7710 ccs 25 0 327m 6936 4052 R 102 0.7 4:14.17 mpi-x-
> povray
> 7712 ccs 25 0 327m 6884 4000 R 102 0.7 3:34.06 mpi-x-
> povray
> 7708 ccs 25 0 327m 6896 4012 R 66 0.7 2:42.10 mpi-x-
> povray
> 7707 ccs 25 0 331m 10m 3736 R 54 1.0 3:08.62 mpi-x-
> povray
> 7709 ccs 25 0 327m 6940 4056 R 48 0.7 1:48.24 mpi-x-
> povray
> 7711 ccs 25 0 327m 6724 4032 R 36 0.7 1:29.34 mpi-x-
> povray
> -------------------------------------------------------
>
> Now I killed the hanging ompi-checkpoint operation and tried
> to execute a checkpoint manually:
>
> -------------------------------------------------------
> ccs_at_grid-demo-1:~$ ompi-checkpoint -v --term 7706
> [grid-demo-1.cit.tu-berlin.de:08224] orte_checkpoint: Checkpointing...
> [grid-demo-1.cit.tu-berlin.de:08224] PID 7706
> [grid-demo-1.cit.tu-berlin.de:08224] Connected to Mpirun
> [[3623,0],0]
> [grid-demo-1.cit.tu-berlin.de:08224] Terminating after checkpoint
> [grid-demo-1.cit.tu-berlin.de:08224] orte_checkpoint: notify_hnp:
> Contact Head Node Process PID 7706
> [grid-demo-1.cit.tu-berlin.de:08224] orte_checkpoint: notify_hnp:
> Requested a checkpoint of jobid [INVALID]
> -------------------------------------------------------
>
> Is there perhaps a way of increasing the level of debug output?
> Please let me know if I can support you in any way...
>
>
> Best,
> Matthias
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users