Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] ompi-checkpoint is hanging
From: Matthias Hovestadt (maho_at_[hidden])
Date: 2008-10-31 10:51:09


Hi Tim!

First of all: thanks a lot for answering! :-)

> Could you try running your two MPI jobs with fewer procs each,
> say 2 or 3 each instead of 4, so that there are a few extra cores available.

This problem occurrs with any number of procs.

> Also, what happens to the checkpointing of one MPI job if you kill the
> other MPI job
> after the first "hangs"?

Nothing, it keeps hanging.

> (It may not be a true hang, but very very slow progress that you
> are observing.)

I already waited for more than 12 hours, but the ompi-checkpoint
did not return. So if it's slow, it must be very slow.

I continued testing and just observed a case where the problem
occurred with only one job running on the compute node:

-------------------------------------------------------
ccs_at_grid-demo-1:~$ ps auxww | grep mpirun | grep -v grep
ccs 7706 0.4 0.2 63864 2640 ? S 15:35 0:00 mpirun
-np 1 -am ft-enable-cr -np 6
/home/ccs/XN-OMPI/testdrive/loop-1/remotedir/mpi-x-povray +I planet.pov
-w1600 -h1200 +SP1 +O planet.tga
ccs_at_grid-demo-1:~$
-------------------------------------------------------

The resource management system tried to checkpoint this job using the
command "ompi-checkpoint -v --term 7706". This is the output of that
command:

-------------------------------------------------------
[grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: Checkpointing...
[grid-demo-1.cit.tu-berlin.de:08178] PID 7706
[grid-demo-1.cit.tu-berlin.de:08178] Connected to Mpirun [[3623,0],0]
[grid-demo-1.cit.tu-berlin.de:08178] Terminating after checkpoint
[grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: notify_hnp:
Contact Head Node Process PID 7706
[grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: notify_hnp:
Requested a checkpoint of jobid [INVALID]
[grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver:
Receive a command message.
[grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver:
Status Update.
[grid-demo-1.cit.tu-berlin.de:08178] Requested - Global
Snapshot Reference: (null)
[grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver:
Receive a command message.
[grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver:
Status Update.
[grid-demo-1.cit.tu-berlin.de:08178] Pending (Termination) - Global
Snapshot Reference: (null)
[grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver:
Receive a command message.
[grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver:
Status Update.
[grid-demo-1.cit.tu-berlin.de:08178] Running - Global
Snapshot Reference: (null)
-------------------------------------------------------

If I look to the activity on the node, I see that the processes
are still computing:

-------------------------------------------------------
   PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
  7710 ccs 25 0 327m 6936 4052 R 102 0.7 4:14.17 mpi-x-povray
  7712 ccs 25 0 327m 6884 4000 R 102 0.7 3:34.06 mpi-x-povray
  7708 ccs 25 0 327m 6896 4012 R 66 0.7 2:42.10 mpi-x-povray
  7707 ccs 25 0 331m 10m 3736 R 54 1.0 3:08.62 mpi-x-povray
  7709 ccs 25 0 327m 6940 4056 R 48 0.7 1:48.24 mpi-x-povray
  7711 ccs 25 0 327m 6724 4032 R 36 0.7 1:29.34 mpi-x-povray
-------------------------------------------------------

Now I killed the hanging ompi-checkpoint operation and tried
to execute a checkpoint manually:

-------------------------------------------------------
ccs_at_grid-demo-1:~$ ompi-checkpoint -v --term 7706
[grid-demo-1.cit.tu-berlin.de:08224] orte_checkpoint: Checkpointing...
[grid-demo-1.cit.tu-berlin.de:08224] PID 7706
[grid-demo-1.cit.tu-berlin.de:08224] Connected to Mpirun [[3623,0],0]
[grid-demo-1.cit.tu-berlin.de:08224] Terminating after checkpoint
[grid-demo-1.cit.tu-berlin.de:08224] orte_checkpoint: notify_hnp:
Contact Head Node Process PID 7706
[grid-demo-1.cit.tu-berlin.de:08224] orte_checkpoint: notify_hnp:
Requested a checkpoint of jobid [INVALID]
-------------------------------------------------------

Is there perhaps a way of increasing the level of debug output?
Please let me know if I can support you in any way...

Best,
Matthias