Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] Checkpoint/Restart svn trunk
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-08-11 19:28:22


This is likely to two things:

- we just made some minor changes to the session directory stuff
- the checkpoint/restart guy (Josh) is off on vacation for about 3 weeks

I'll file a ticket about this so that he's aware of it and can fix it
when he returns.

Thanks for the heads-up!

On Aug 11, 2008, at 7:16 PM, Caciano Machado wrote:

> I found that open mpi is looking for the file contact.txt in the wrong
> directory. It always searches the file in the directory
> "/tmp/openmpi-sessions-root_at_debian_0/<MPIRUN PID>/" but this file
> exists only in "/tmp/openmpi-sessions-root_at_debian_0/<MPIRUN PID>/0".
> When I copy contact.txt to the directory where open mpi searches, then
> "ompi-ps" and "ompi-checkpoint" works.
>
> On Mon, Aug 11, 2008 at 4:06 PM, Caciano Machado <caciano_at_[hidden]>
> wrote:
>> Hi,
>>
>> I'm trying to run the last checkpoint/restart (rev 19235) but ompi is
>> showing the following error in "ompi-checkpoint".
>>
>> It seems to be something in function "orte_list_local_hnps" of the
>> file orte/util/hnp_contact.c. I'm using BLCR 0.7.2 and it's working
>> correctly with the example applications.
>>
>> ################################################################
>> root_at_debian:~/pp# ompi-clean
>> root_at_debian:~/pp# mpirun -machinefile machinefile -np 2 -am
>> ft-enable-cr -v -d pp 1 2 1000000
>> [debian:27936] procdir: /tmp/openmpi-sessions-root_at_debian_0/31810/0/0
>> [debian:27936] jobdir: /tmp/openmpi-sessions-root_at_debian_0/31810/0
>> [debian:27936] top: openmpi-sessions-root_at_debian_0
>> [debian:27936] tmp: /tmp
>> [debian:27936] [[31810,0],0] hostfile: checking hostfile
>> machinefile for nodes
>> [debian:27936] [[31810,0],0] hostfile: filtering nodes through
>> hostfile machinefile
>> [debian:27936] progressed_wait: base/plm_base_launch_support.c 436
>> [debian:27940] procdir: /tmp/openmpi-sessions-root_at_debian_0/31810/0/1
>> [debian:27940] jobdir: /tmp/openmpi-sessions-root_at_debian_0/31810/0
>> [debian:27940] top: openmpi-sessions-root_at_debian_0
>> [debian:27940] tmp: /tmp
>> [debian:27936] defining message event: base/
>> plm_base_launch_support.c 400
>> [debian:27936] defining message event: grpcomm_bad_module.c 183
>> [debian:27936] progressed_wait: base/plm_base_launch_support.c 679
>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor called by
>> [[31810,0],0] for tag 1
>> [debian:27936] defining message event: orted/orted_comm.c 382
>> [debian:27936] [[31810,0],0] node[0].name debian daemon 0 arch
>> ffca0200
>> [debian:27936] [[31810,0],0] node[1].name debian daemon 1 arch
>> ffca0200
>> [debian:27936] defining message event: base/odls_base_default_fns.c
>> 1060
>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor: processing
>> commands completed
>> [debian:27936] [[31810,0],0] orte:daemon:send_relay
>> [debian:27936] [[31810,0],0] orte:daemon:send_relay sending relay msg
>> to [[31810,0],1]
>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from
>> [[31810,0],0]
>> [debian:27940] defining message event: orted/orted_comm.c 277
>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
>> [[31810,0],0] for tag 1
>> [debian:27940] defining message event: orted/orted_comm.c 382
>> [debian:27940] [[31810,0],1] node[0].name debian daemon 0 arch
>> ffca0200
>> [debian:27940] [[31810,0],1] node[1].name debian daemon 1 arch
>> ffca0200
>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
>> commands completed
>> [debian:27940] [[31810,0],1] orte:daemon:send_relay
>> [debian:27940] [[31810,0],1] orte:daemon:send_relay - recipient
>> list is empty!
>> [debian:27936] defining message event: base/
>> plm_base_launch_support.c 635
>> [debian:27936] Info: Setting up debugger process table for
>> applications
>> MPIR_being_debugged = 0
>> MPIR_debug_state = 1
>> MPIR_partial_attach_ok = 1
>> MPIR_i_am_starter = 0
>> MPIR_proctable_size = 2
>> MPIR_proctable:
>> (i, host, exe, pid) = (0, debian, /root/pp/pp, 27941)
>> (i, host, exe, pid) = (1, debian, /root/pp/pp, 27942)
>> [debian:27942] procdir: /tmp/openmpi-sessions-root_at_debian_0/31810/1/1
>> [debian:27941] procdir: /tmp/openmpi-sessions-root_at_debian_0/31810/1/0
>> [debian:27941] jobdir: /tmp/openmpi-sessions-root_at_debian_0/31810/1
>> [debian:27941] top: openmpi-sessions-root_at_debian_0
>> [debian:27941] tmp: /tmp
>> [debian:27942] jobdir: /tmp/openmpi-sessions-root_at_debian_0/31810/1
>> [debian:27942] top: openmpi-sessions-root_at_debian_0
>> [debian:27942] tmp: /tmp
>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from
>> [[31810,1],0]
>> [debian:27940] defining message event: orted/orted_comm.c 277
>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
>> [[31810,1],0] for tag 1
>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
>> commands completed
>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from
>> [[31810,1],1]
>> [debian:27940] defining message event: orted/orted_comm.c 277
>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
>> [[31810,1],1] for tag 1
>> [debian:27936] defining message event: base/routed_base_receive.c 153
>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
>> commands completed
>> [debian:27941] progressed_wait: base/routed_base_register_sync.c 104
>> [debian:27942] progressed_wait: base/routed_base_register_sync.c 104
>> [debian:27941] [[31810,1],0] node[0].name debian daemon 0 arch
>> ffca0200
>> [debian:27941] [[31810,1],0] node[1].name debian daemon 1 arch
>> ffca0200
>> [debian:27942] [[31810,1],1] node[0].name debian daemon 0 arch
>> ffca0200
>> [debian:27942] [[31810,1],1] node[1].name debian daemon 1 arch
>> ffca0200
>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from
>> [[31810,1],0]
>> [debian:27940] defining message event: orted/orted_comm.c 277
>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
>> [[31810,1],0] for tag 1
>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
>> commands completed
>> [debian:27941] progressed_wait: grpcomm_bad_module.c 394
>> [debian:27936] [[31810,0],0] orted_recv_cmd: received message from
>> [[31810,0],1]
>> [debian:27936] defining message event: orted/orted_comm.c 277
>> [debian:27936] [[31810,0],0] orted_recv_cmd: reissued recv
>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor called by
>> [[31810,0],1] for tag 1
>> [debian:27936] defining message event: grpcomm_bad_module.c 183
>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor: processing
>> commands completed
>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor called by
>> [[31810,0],0] for tag 1
>> [debian:27936] defining message event: orted/orted_comm.c 382
>> [debian:27936] [[31810,0],0] orted:comm:message_local_procs
>> delivering
>> message to job [31810,1] tag 15
>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor: processing
>> commands completed
>> [debian:27936] [[31810,0],0] orte:daemon:send_relay
>> [debian:27936] [[31810,0],0] orte:daemon:send_relay sending relay msg
>> to [[31810,0],1]
>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from
>> [[31810,1],1]
>> [debian:27940] defining message event: orted/orted_comm.c 277
>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
>> [[31810,1],1] for tag 1
>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
>> commands completed
>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from
>> [[31810,0],0]
>> [debian:27940] defining message event: orted/orted_comm.c 277
>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
>> [[31810,0],0] for tag 1
>> [debian:27940] defining message event: orted/orted_comm.c 382
>> [debian:27940] [[31810,0],1] orted:comm:message_local_procs
>> delivering
>> message to job [31810,1] tag 15
>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
>> commands completed
>> [debian:27940] [[31810,0],1] orte:daemon:send_relay
>> [debian:27940] [[31810,0],1] orte:daemon:send_relay - recipient
>> list is empty!
>> [debian:27942] progressed_wait: grpcomm_bad_module.c 394
>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from
>> [[31810,1],1]
>> [debian:27940] defining message event: orted/orted_comm.c 277
>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
>> [[31810,1],1] for tag 1
>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
>> commands completed
>> [debian:27942] progressed_wait: grpcomm_bad_module.c 270
>> [debian:27936] [[31810,0],0] orted_recv_cmd: received message from
>> [[31810,0],1]
>> [debian:27936] defining message event: orted/orted_comm.c 277
>> [debian:27936] [[31810,0],0] orted_recv_cmd: reissued recv
>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor called by
>> [[31810,0],1] for tag 1
>> [debian:27936] defining message event: grpcomm_bad_module.c 183
>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor: processing
>> commands completed
>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor called by
>> [[31810,0],0] for tag 1
>> [debian:27936] defining message event: orted/orted_comm.c 382
>> [debian:27936] [[31810,0],0] orted:comm:message_local_procs
>> delivering
>> message to job [31810,1] tag 17
>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor: processing
>> commands completed
>> [debian:27936] [[31810,0],0] orte:daemon:send_relay
>> [debian:27936] [[31810,0],0] orte:daemon:send_relay sending relay msg
>> to [[31810,0],1]
>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from
>> [[31810,1],0]
>> [debian:27940] defining message event: orted/orted_comm.c 277
>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
>> [[31810,1],0] for tag 1
>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
>> commands completed
>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from
>> [[31810,0],0]
>> [debian:27940] defining message event: orted/orted_comm.c 277
>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
>> [[31810,0],0] for tag 1
>> [debian:27940] defining message event: orted/orted_comm.c 382
>> [debian:27940] [[31810,0],1] orted:comm:message_local_procs
>> delivering
>> message to job [31810,1] tag 17
>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
>> commands completed
>> [debian:27940] [[31810,0],1] orte:daemon:send_relay
>> [debian:27940] [[31810,0],1] orte:daemon:send_relay - recipient
>> list is empty!
>> [debian:27941] progressed_wait: grpcomm_bad_module.c 270
>> #
>> # ping-pong com MPI
>> #
>> # msgs from 1 to 2 bytes
>> # results are the mean of 1000000 repetitions for each msg size
>> # Tue Aug 12 06:26:29 2008
>> #
>> # size lat (us) bw (MB/s)
>>
>> ################################################################
>> 27936 pts/1 S+ 0:00 mpirun -machinefile machinefile -np 2 -am
>> ft-enable-cr -v -d pp 1 2 1000000
>> 27937 pts/1 S+ 0:00 /usr/bin/ssh -x debian orted --debug
>> --heartbeat 0 -mca ess env -mca orte_ess_jobid 2084700160
>> 27938 ? Ss 0:00 sshd: root_at_notty
>> 27940 ? Ss 0:00 orted --debug --heartbeat 0 -mca ess env
>> -mca orte_ess_jobid 2084700160 -mca orte_ess_vpid 1 -mc
>> 27941 ? Rl 0:21 pp 1 2 1000000
>> 27942 ? Rl 0:21 pp 1 2 1000000
>> 28021 pts/0 R+ 0:00 ps xa
>>
>> root_at_debian:~/pp# ompi-checkpoint 27936 -v
>> [debian:28022] [[31764,0],0] ORTE_ERROR_LOG: Not found in file
>> orte-checkpoint.c at line 395
>> [debian:28022] HNP with PID 27936 Not found!
>>
>> ################################################################
>>
>> Regards,
>> Caciano Machado
>> Computer Science Graduate Student/UFRGS
>>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
Cisco Systems