Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Checkpoint/Restart svn trunk
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-08-12 07:21:31


Ralph committed a proper fix yesterday; see if that works for you.

On Aug 11, 2008, at 7:44 PM, Caciano Machado wrote:

> Jeff,
>
> Here is an ugly hack that I'm using to get this working in Linux until
> Josh returns.
>
> ##########################################################
> --- ompi-trunk/orte/util/hnp_contact.c 2008-08-12 12:10:07.000000000
> +0200
> +++ ompi-trunk-caciano/orte/util/hnp_contact.c 2008-08-12
> 12:08:52.000000000 +0200
> @@ -255,7 +255,7 @@
> * See if a contact file exists in this directory and read it
> */
> contact_filename = opal_os_path( false, headdir,
> - dir_entry->d_name, "contact.txt", NULL );
> + dir_entry->d_name, "0/contact.txt", NULL );
>
> hnp = OBJ_NEW(orte_hnp_contact_t);
> if (ORTE_SUCCESS == (ret =
> orte_read_hnp_contact_file(contact_filename, hnp))) {
> ##########################################################
>
> Regards
>
> On Mon, Aug 11, 2008 at 8:28 PM, Jeff Squyres <jsquyres_at_[hidden]>
> wrote:
>> This is likely to two things:
>>
>> - we just made some minor changes to the session directory stuff
>> - the checkpoint/restart guy (Josh) is off on vacation for about 3
>> weeks
>>
>> I'll file a ticket about this so that he's aware of it and can fix
>> it when
>> he returns.
>>
>> Thanks for the heads-up!
>>
>>
>> On Aug 11, 2008, at 7:16 PM, Caciano Machado wrote:
>>
>>> I found that open mpi is looking for the file contact.txt in the
>>> wrong
>>> directory. It always searches the file in the directory
>>> "/tmp/openmpi-sessions-root_at_debian_0/<MPIRUN PID>/" but this file
>>> exists only in "/tmp/openmpi-sessions-root_at_debian_0/<MPIRUN PID>/0".
>>> When I copy contact.txt to the directory where open mpi searches,
>>> then
>>> "ompi-ps" and "ompi-checkpoint" works.
>>>
>>> On Mon, Aug 11, 2008 at 4:06 PM, Caciano Machado <caciano_at_[hidden]>
>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I'm trying to run the last checkpoint/restart (rev 19235) but
>>>> ompi is
>>>> showing the following error in "ompi-checkpoint".
>>>>
>>>> It seems to be something in function "orte_list_local_hnps" of the
>>>> file orte/util/hnp_contact.c. I'm using BLCR 0.7.2 and it's working
>>>> correctly with the example applications.
>>>>
>>>> ################################################################
>>>> root_at_debian:~/pp# ompi-clean
>>>> root_at_debian:~/pp# mpirun -machinefile machinefile -np 2 -am
>>>> ft-enable-cr -v -d pp 1 2 1000000
>>>> [debian:27936] procdir: /tmp/openmpi-sessions-
>>>> root_at_debian_0/31810/0/0
>>>> [debian:27936] jobdir: /tmp/openmpi-sessions-root_at_debian_0/31810/0
>>>> [debian:27936] top: openmpi-sessions-root_at_debian_0
>>>> [debian:27936] tmp: /tmp
>>>> [debian:27936] [[31810,0],0] hostfile: checking hostfile
>>>> machinefile for
>>>> nodes
>>>> [debian:27936] [[31810,0],0] hostfile: filtering nodes through
>>>> hostfile machinefile
>>>> [debian:27936] progressed_wait: base/plm_base_launch_support.c 436
>>>> [debian:27940] procdir: /tmp/openmpi-sessions-
>>>> root_at_debian_0/31810/0/1
>>>> [debian:27940] jobdir: /tmp/openmpi-sessions-root_at_debian_0/31810/0
>>>> [debian:27940] top: openmpi-sessions-root_at_debian_0
>>>> [debian:27940] tmp: /tmp
>>>> [debian:27936] defining message event: base/
>>>> plm_base_launch_support.c 400
>>>> [debian:27936] defining message event: grpcomm_bad_module.c 183
>>>> [debian:27936] progressed_wait: base/plm_base_launch_support.c 679
>>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor called by
>>>> [[31810,0],0] for tag 1
>>>> [debian:27936] defining message event: orted/orted_comm.c 382
>>>> [debian:27936] [[31810,0],0] node[0].name debian daemon 0 arch
>>>> ffca0200
>>>> [debian:27936] [[31810,0],0] node[1].name debian daemon 1 arch
>>>> ffca0200
>>>> [debian:27936] defining message event: base/
>>>> odls_base_default_fns.c 1060
>>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor: processing
>>>> commands completed
>>>> [debian:27936] [[31810,0],0] orte:daemon:send_relay
>>>> [debian:27936] [[31810,0],0] orte:daemon:send_relay sending relay
>>>> msg
>>>> to [[31810,0],1]
>>>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from
>>>> [[31810,0],0]
>>>> [debian:27940] defining message event: orted/orted_comm.c 277
>>>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
>>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
>>>> [[31810,0],0] for tag 1
>>>> [debian:27940] defining message event: orted/orted_comm.c 382
>>>> [debian:27940] [[31810,0],1] node[0].name debian daemon 0 arch
>>>> ffca0200
>>>> [debian:27940] [[31810,0],1] node[1].name debian daemon 1 arch
>>>> ffca0200
>>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
>>>> commands completed
>>>> [debian:27940] [[31810,0],1] orte:daemon:send_relay
>>>> [debian:27940] [[31810,0],1] orte:daemon:send_relay - recipient
>>>> list is
>>>> empty!
>>>> [debian:27936] defining message event: base/
>>>> plm_base_launch_support.c 635
>>>> [debian:27936] Info: Setting up debugger process table for
>>>> applications
>>>> MPIR_being_debugged = 0
>>>> MPIR_debug_state = 1
>>>> MPIR_partial_attach_ok = 1
>>>> MPIR_i_am_starter = 0
>>>> MPIR_proctable_size = 2
>>>> MPIR_proctable:
>>>> (i, host, exe, pid) = (0, debian, /root/pp/pp, 27941)
>>>> (i, host, exe, pid) = (1, debian, /root/pp/pp, 27942)
>>>> [debian:27942] procdir: /tmp/openmpi-sessions-
>>>> root_at_debian_0/31810/1/1
>>>> [debian:27941] procdir: /tmp/openmpi-sessions-
>>>> root_at_debian_0/31810/1/0
>>>> [debian:27941] jobdir: /tmp/openmpi-sessions-root_at_debian_0/31810/1
>>>> [debian:27941] top: openmpi-sessions-root_at_debian_0
>>>> [debian:27941] tmp: /tmp
>>>> [debian:27942] jobdir: /tmp/openmpi-sessions-root_at_debian_0/31810/1
>>>> [debian:27942] top: openmpi-sessions-root_at_debian_0
>>>> [debian:27942] tmp: /tmp
>>>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from
>>>> [[31810,1],0]
>>>> [debian:27940] defining message event: orted/orted_comm.c 277
>>>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
>>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
>>>> [[31810,1],0] for tag 1
>>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
>>>> commands completed
>>>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from
>>>> [[31810,1],1]
>>>> [debian:27940] defining message event: orted/orted_comm.c 277
>>>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
>>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
>>>> [[31810,1],1] for tag 1
>>>> [debian:27936] defining message event: base/routed_base_receive.c
>>>> 153
>>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
>>>> commands completed
>>>> [debian:27941] progressed_wait: base/routed_base_register_sync.c
>>>> 104
>>>> [debian:27942] progressed_wait: base/routed_base_register_sync.c
>>>> 104
>>>> [debian:27941] [[31810,1],0] node[0].name debian daemon 0 arch
>>>> ffca0200
>>>> [debian:27941] [[31810,1],0] node[1].name debian daemon 1 arch
>>>> ffca0200
>>>> [debian:27942] [[31810,1],1] node[0].name debian daemon 0 arch
>>>> ffca0200
>>>> [debian:27942] [[31810,1],1] node[1].name debian daemon 1 arch
>>>> ffca0200
>>>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from
>>>> [[31810,1],0]
>>>> [debian:27940] defining message event: orted/orted_comm.c 277
>>>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
>>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
>>>> [[31810,1],0] for tag 1
>>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
>>>> commands completed
>>>> [debian:27941] progressed_wait: grpcomm_bad_module.c 394
>>>> [debian:27936] [[31810,0],0] orted_recv_cmd: received message from
>>>> [[31810,0],1]
>>>> [debian:27936] defining message event: orted/orted_comm.c 277
>>>> [debian:27936] [[31810,0],0] orted_recv_cmd: reissued recv
>>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor called by
>>>> [[31810,0],1] for tag 1
>>>> [debian:27936] defining message event: grpcomm_bad_module.c 183
>>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor: processing
>>>> commands completed
>>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor called by
>>>> [[31810,0],0] for tag 1
>>>> [debian:27936] defining message event: orted/orted_comm.c 382
>>>> [debian:27936] [[31810,0],0] orted:comm:message_local_procs
>>>> delivering
>>>> message to job [31810,1] tag 15
>>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor: processing
>>>> commands completed
>>>> [debian:27936] [[31810,0],0] orte:daemon:send_relay
>>>> [debian:27936] [[31810,0],0] orte:daemon:send_relay sending relay
>>>> msg
>>>> to [[31810,0],1]
>>>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from
>>>> [[31810,1],1]
>>>> [debian:27940] defining message event: orted/orted_comm.c 277
>>>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
>>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
>>>> [[31810,1],1] for tag 1
>>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
>>>> commands completed
>>>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from
>>>> [[31810,0],0]
>>>> [debian:27940] defining message event: orted/orted_comm.c 277
>>>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
>>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
>>>> [[31810,0],0] for tag 1
>>>> [debian:27940] defining message event: orted/orted_comm.c 382
>>>> [debian:27940] [[31810,0],1] orted:comm:message_local_procs
>>>> delivering
>>>> message to job [31810,1] tag 15
>>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
>>>> commands completed
>>>> [debian:27940] [[31810,0],1] orte:daemon:send_relay
>>>> [debian:27940] [[31810,0],1] orte:daemon:send_relay - recipient
>>>> list is
>>>> empty!
>>>> [debian:27942] progressed_wait: grpcomm_bad_module.c 394
>>>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from
>>>> [[31810,1],1]
>>>> [debian:27940] defining message event: orted/orted_comm.c 277
>>>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
>>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
>>>> [[31810,1],1] for tag 1
>>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
>>>> commands completed
>>>> [debian:27942] progressed_wait: grpcomm_bad_module.c 270
>>>> [debian:27936] [[31810,0],0] orted_recv_cmd: received message from
>>>> [[31810,0],1]
>>>> [debian:27936] defining message event: orted/orted_comm.c 277
>>>> [debian:27936] [[31810,0],0] orted_recv_cmd: reissued recv
>>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor called by
>>>> [[31810,0],1] for tag 1
>>>> [debian:27936] defining message event: grpcomm_bad_module.c 183
>>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor: processing
>>>> commands completed
>>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor called by
>>>> [[31810,0],0] for tag 1
>>>> [debian:27936] defining message event: orted/orted_comm.c 382
>>>> [debian:27936] [[31810,0],0] orted:comm:message_local_procs
>>>> delivering
>>>> message to job [31810,1] tag 17
>>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor: processing
>>>> commands completed
>>>> [debian:27936] [[31810,0],0] orte:daemon:send_relay
>>>> [debian:27936] [[31810,0],0] orte:daemon:send_relay sending relay
>>>> msg
>>>> to [[31810,0],1]
>>>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from
>>>> [[31810,1],0]
>>>> [debian:27940] defining message event: orted/orted_comm.c 277
>>>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
>>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
>>>> [[31810,1],0] for tag 1
>>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
>>>> commands completed
>>>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from
>>>> [[31810,0],0]
>>>> [debian:27940] defining message event: orted/orted_comm.c 277
>>>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
>>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
>>>> [[31810,0],0] for tag 1
>>>> [debian:27940] defining message event: orted/orted_comm.c 382
>>>> [debian:27940] [[31810,0],1] orted:comm:message_local_procs
>>>> delivering
>>>> message to job [31810,1] tag 17
>>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
>>>> commands completed
>>>> [debian:27940] [[31810,0],1] orte:daemon:send_relay
>>>> [debian:27940] [[31810,0],1] orte:daemon:send_relay - recipient
>>>> list is
>>>> empty!
>>>> [debian:27941] progressed_wait: grpcomm_bad_module.c 270
>>>> #
>>>> # ping-pong com MPI
>>>> #
>>>> # msgs from 1 to 2 bytes
>>>> # results are the mean of 1000000 repetitions for each msg size
>>>> # Tue Aug 12 06:26:29 2008
>>>> #
>>>> # size lat (us) bw (MB/s)
>>>>
>>>> ################################################################
>>>> 27936 pts/1 S+ 0:00 mpirun -machinefile machinefile -np 2 -
>>>> am
>>>> ft-enable-cr -v -d pp 1 2 1000000
>>>> 27937 pts/1 S+ 0:00 /usr/bin/ssh -x debian orted --debug
>>>> --heartbeat 0 -mca ess env -mca orte_ess_jobid 2084700160
>>>> 27938 ? Ss 0:00 sshd: root_at_notty
>>>> 27940 ? Ss 0:00 orted --debug --heartbeat 0 -mca ess env
>>>> -mca orte_ess_jobid 2084700160 -mca orte_ess_vpid 1 -mc
>>>> 27941 ? Rl 0:21 pp 1 2 1000000
>>>> 27942 ? Rl 0:21 pp 1 2 1000000
>>>> 28021 pts/0 R+ 0:00 ps xa
>>>>
>>>> root_at_debian:~/pp# ompi-checkpoint 27936 -v
>>>> [debian:28022] [[31764,0],0] ORTE_ERROR_LOG: Not found in file
>>>> orte-checkpoint.c at line 395
>>>> [debian:28022] HNP with PID 27936 Not found!
>>>>
>>>> ################################################################
>>>>
>>>> Regards,
>>>> Caciano Machado
>>>> Computer Science Graduate Student/UFRGS
>>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
Cisco Systems