Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] Checkpoint/Restart svn trunk
From: Caciano Machado (caciano_at_[hidden])
Date: 2008-08-11 15:06:34


Hi,

I'm trying to run the last checkpoint/restart (rev 19235) but ompi is
showing the following error in "ompi-checkpoint".

It seems to be something in function "orte_list_local_hnps" of the
file orte/util/hnp_contact.c. I'm using BLCR 0.7.2 and it's working
correctly with the example applications.

################################################################
root_at_debian:~/pp# ompi-clean
root_at_debian:~/pp# mpirun -machinefile machinefile -np 2 -am
ft-enable-cr -v -d pp 1 2 1000000
[debian:27936] procdir: /tmp/openmpi-sessions-root_at_debian_0/31810/0/0
[debian:27936] jobdir: /tmp/openmpi-sessions-root_at_debian_0/31810/0
[debian:27936] top: openmpi-sessions-root_at_debian_0
[debian:27936] tmp: /tmp
[debian:27936] [[31810,0],0] hostfile: checking hostfile machinefile for nodes
[debian:27936] [[31810,0],0] hostfile: filtering nodes through
hostfile machinefile
[debian:27936] progressed_wait: base/plm_base_launch_support.c 436
[debian:27940] procdir: /tmp/openmpi-sessions-root_at_debian_0/31810/0/1
[debian:27940] jobdir: /tmp/openmpi-sessions-root_at_debian_0/31810/0
[debian:27940] top: openmpi-sessions-root_at_debian_0
[debian:27940] tmp: /tmp
[debian:27936] defining message event: base/plm_base_launch_support.c 400
[debian:27936] defining message event: grpcomm_bad_module.c 183
[debian:27936] progressed_wait: base/plm_base_launch_support.c 679
[debian:27936] [[31810,0],0] orte:daemon:cmd:processor called by
[[31810,0],0] for tag 1
[debian:27936] defining message event: orted/orted_comm.c 382
[debian:27936] [[31810,0],0] node[0].name debian daemon 0 arch ffca0200
[debian:27936] [[31810,0],0] node[1].name debian daemon 1 arch ffca0200
[debian:27936] defining message event: base/odls_base_default_fns.c 1060
[debian:27936] [[31810,0],0] orte:daemon:cmd:processor: processing
commands completed
[debian:27936] [[31810,0],0] orte:daemon:send_relay
[debian:27936] [[31810,0],0] orte:daemon:send_relay sending relay msg
to [[31810,0],1]
[debian:27940] [[31810,0],1] orted_recv_cmd: received message from [[31810,0],0]
[debian:27940] defining message event: orted/orted_comm.c 277
[debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
[debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
[[31810,0],0] for tag 1
[debian:27940] defining message event: orted/orted_comm.c 382
[debian:27940] [[31810,0],1] node[0].name debian daemon 0 arch ffca0200
[debian:27940] [[31810,0],1] node[1].name debian daemon 1 arch ffca0200
[debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
commands completed
[debian:27940] [[31810,0],1] orte:daemon:send_relay
[debian:27940] [[31810,0],1] orte:daemon:send_relay - recipient list is empty!
[debian:27936] defining message event: base/plm_base_launch_support.c 635
[debian:27936] Info: Setting up debugger process table for applications
  MPIR_being_debugged = 0
  MPIR_debug_state = 1
  MPIR_partial_attach_ok = 1
  MPIR_i_am_starter = 0
  MPIR_proctable_size = 2
  MPIR_proctable:
    (i, host, exe, pid) = (0, debian, /root/pp/pp, 27941)
    (i, host, exe, pid) = (1, debian, /root/pp/pp, 27942)
[debian:27942] procdir: /tmp/openmpi-sessions-root_at_debian_0/31810/1/1
[debian:27941] procdir: /tmp/openmpi-sessions-root_at_debian_0/31810/1/0
[debian:27941] jobdir: /tmp/openmpi-sessions-root_at_debian_0/31810/1
[debian:27941] top: openmpi-sessions-root_at_debian_0
[debian:27941] tmp: /tmp
[debian:27942] jobdir: /tmp/openmpi-sessions-root_at_debian_0/31810/1
[debian:27942] top: openmpi-sessions-root_at_debian_0
[debian:27942] tmp: /tmp
[debian:27940] [[31810,0],1] orted_recv_cmd: received message from [[31810,1],0]
[debian:27940] defining message event: orted/orted_comm.c 277
[debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
[debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
[[31810,1],0] for tag 1
[debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
commands completed
[debian:27940] [[31810,0],1] orted_recv_cmd: received message from [[31810,1],1]
[debian:27940] defining message event: orted/orted_comm.c 277
[debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
[debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
[[31810,1],1] for tag 1
[debian:27936] defining message event: base/routed_base_receive.c 153
[debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
commands completed
[debian:27941] progressed_wait: base/routed_base_register_sync.c 104
[debian:27942] progressed_wait: base/routed_base_register_sync.c 104
[debian:27941] [[31810,1],0] node[0].name debian daemon 0 arch ffca0200
[debian:27941] [[31810,1],0] node[1].name debian daemon 1 arch ffca0200
[debian:27942] [[31810,1],1] node[0].name debian daemon 0 arch ffca0200
[debian:27942] [[31810,1],1] node[1].name debian daemon 1 arch ffca0200
[debian:27940] [[31810,0],1] orted_recv_cmd: received message from [[31810,1],0]
[debian:27940] defining message event: orted/orted_comm.c 277
[debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
[debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
[[31810,1],0] for tag 1
[debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
commands completed
[debian:27941] progressed_wait: grpcomm_bad_module.c 394
[debian:27936] [[31810,0],0] orted_recv_cmd: received message from [[31810,0],1]
[debian:27936] defining message event: orted/orted_comm.c 277
[debian:27936] [[31810,0],0] orted_recv_cmd: reissued recv
[debian:27936] [[31810,0],0] orte:daemon:cmd:processor called by
[[31810,0],1] for tag 1
[debian:27936] defining message event: grpcomm_bad_module.c 183
[debian:27936] [[31810,0],0] orte:daemon:cmd:processor: processing
commands completed
[debian:27936] [[31810,0],0] orte:daemon:cmd:processor called by
[[31810,0],0] for tag 1
[debian:27936] defining message event: orted/orted_comm.c 382
[debian:27936] [[31810,0],0] orted:comm:message_local_procs delivering
message to job [31810,1] tag 15
[debian:27936] [[31810,0],0] orte:daemon:cmd:processor: processing
commands completed
[debian:27936] [[31810,0],0] orte:daemon:send_relay
[debian:27936] [[31810,0],0] orte:daemon:send_relay sending relay msg
to [[31810,0],1]
[debian:27940] [[31810,0],1] orted_recv_cmd: received message from [[31810,1],1]
[debian:27940] defining message event: orted/orted_comm.c 277
[debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
[debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
[[31810,1],1] for tag 1
[debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
commands completed
[debian:27940] [[31810,0],1] orted_recv_cmd: received message from [[31810,0],0]
[debian:27940] defining message event: orted/orted_comm.c 277
[debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
[debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
[[31810,0],0] for tag 1
[debian:27940] defining message event: orted/orted_comm.c 382
[debian:27940] [[31810,0],1] orted:comm:message_local_procs delivering
message to job [31810,1] tag 15
[debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
commands completed
[debian:27940] [[31810,0],1] orte:daemon:send_relay
[debian:27940] [[31810,0],1] orte:daemon:send_relay - recipient list is empty!
[debian:27942] progressed_wait: grpcomm_bad_module.c 394
[debian:27940] [[31810,0],1] orted_recv_cmd: received message from [[31810,1],1]
[debian:27940] defining message event: orted/orted_comm.c 277
[debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
[debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
[[31810,1],1] for tag 1
[debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
commands completed
[debian:27942] progressed_wait: grpcomm_bad_module.c 270
[debian:27936] [[31810,0],0] orted_recv_cmd: received message from [[31810,0],1]
[debian:27936] defining message event: orted/orted_comm.c 277
[debian:27936] [[31810,0],0] orted_recv_cmd: reissued recv
[debian:27936] [[31810,0],0] orte:daemon:cmd:processor called by
[[31810,0],1] for tag 1
[debian:27936] defining message event: grpcomm_bad_module.c 183
[debian:27936] [[31810,0],0] orte:daemon:cmd:processor: processing
commands completed
[debian:27936] [[31810,0],0] orte:daemon:cmd:processor called by
[[31810,0],0] for tag 1
[debian:27936] defining message event: orted/orted_comm.c 382
[debian:27936] [[31810,0],0] orted:comm:message_local_procs delivering
message to job [31810,1] tag 17
[debian:27936] [[31810,0],0] orte:daemon:cmd:processor: processing
commands completed
[debian:27936] [[31810,0],0] orte:daemon:send_relay
[debian:27936] [[31810,0],0] orte:daemon:send_relay sending relay msg
to [[31810,0],1]
[debian:27940] [[31810,0],1] orted_recv_cmd: received message from [[31810,1],0]
[debian:27940] defining message event: orted/orted_comm.c 277
[debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
[debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
[[31810,1],0] for tag 1
[debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
commands completed
[debian:27940] [[31810,0],1] orted_recv_cmd: received message from [[31810,0],0]
[debian:27940] defining message event: orted/orted_comm.c 277
[debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
[debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
[[31810,0],0] for tag 1
[debian:27940] defining message event: orted/orted_comm.c 382
[debian:27940] [[31810,0],1] orted:comm:message_local_procs delivering
message to job [31810,1] tag 17
[debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
commands completed
[debian:27940] [[31810,0],1] orte:daemon:send_relay
[debian:27940] [[31810,0],1] orte:daemon:send_relay - recipient list is empty!
[debian:27941] progressed_wait: grpcomm_bad_module.c 270
#
# ping-pong com MPI
#
# msgs from 1 to 2 bytes
# results are the mean of 1000000 repetitions for each msg size
# Tue Aug 12 06:26:29 2008
#
# size lat (us) bw (MB/s)

################################################################
27936 pts/1 S+ 0:00 mpirun -machinefile machinefile -np 2 -am
ft-enable-cr -v -d pp 1 2 1000000
27937 pts/1 S+ 0:00 /usr/bin/ssh -x debian orted --debug
--heartbeat 0 -mca ess env -mca orte_ess_jobid 2084700160
27938 ? Ss 0:00 sshd: root_at_notty
27940 ? Ss 0:00 orted --debug --heartbeat 0 -mca ess env
-mca orte_ess_jobid 2084700160 -mca orte_ess_vpid 1 -mc
27941 ? Rl 0:21 pp 1 2 1000000
27942 ? Rl 0:21 pp 1 2 1000000
28021 pts/0 R+ 0:00 ps xa

root_at_debian:~/pp# ompi-checkpoint 27936 -v
[debian:28022] [[31764,0],0] ORTE_ERROR_LOG: Not found in file
orte-checkpoint.c at line 395
[debian:28022] HNP with PID 27936 Not found!

################################################################

Regards,
Caciano Machado
Computer Science Graduate Student/UFRGS