Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] Checkpoint/Restart svn trunk
From: Caciano Machado (caciano_at_[hidden])
Date: 2008-08-11 19:44:25


Jeff,

Here is an ugly hack that I'm using to get this working in Linux until
Josh returns.

##########################################################
--- ompi-trunk/orte/util/hnp_contact.c 2008-08-12 12:10:07.000000000 +0200
+++ ompi-trunk-caciano/orte/util/hnp_contact.c 2008-08-12
12:08:52.000000000 +0200
@@ -255,7 +255,7 @@
          * See if a contact file exists in this directory and read it
          */
         contact_filename = opal_os_path( false, headdir,
- dir_entry->d_name, "contact.txt", NULL );
+ dir_entry->d_name, "0/contact.txt", NULL );

         hnp = OBJ_NEW(orte_hnp_contact_t);
         if (ORTE_SUCCESS == (ret =
orte_read_hnp_contact_file(contact_filename, hnp))) {
##########################################################

Regards

On Mon, Aug 11, 2008 at 8:28 PM, Jeff Squyres <jsquyres_at_[hidden]> wrote:
> This is likely to two things:
>
> - we just made some minor changes to the session directory stuff
> - the checkpoint/restart guy (Josh) is off on vacation for about 3 weeks
>
> I'll file a ticket about this so that he's aware of it and can fix it when
> he returns.
>
> Thanks for the heads-up!
>
>
> On Aug 11, 2008, at 7:16 PM, Caciano Machado wrote:
>
>> I found that open mpi is looking for the file contact.txt in the wrong
>> directory. It always searches the file in the directory
>> "/tmp/openmpi-sessions-root_at_debian_0/<MPIRUN PID>/" but this file
>> exists only in "/tmp/openmpi-sessions-root_at_debian_0/<MPIRUN PID>/0".
>> When I copy contact.txt to the directory where open mpi searches, then
>> "ompi-ps" and "ompi-checkpoint" works.
>>
>> On Mon, Aug 11, 2008 at 4:06 PM, Caciano Machado <caciano_at_[hidden]>
>> wrote:
>>>
>>> Hi,
>>>
>>> I'm trying to run the last checkpoint/restart (rev 19235) but ompi is
>>> showing the following error in "ompi-checkpoint".
>>>
>>> It seems to be something in function "orte_list_local_hnps" of the
>>> file orte/util/hnp_contact.c. I'm using BLCR 0.7.2 and it's working
>>> correctly with the example applications.
>>>
>>> ################################################################
>>> root_at_debian:~/pp# ompi-clean
>>> root_at_debian:~/pp# mpirun -machinefile machinefile -np 2 -am
>>> ft-enable-cr -v -d pp 1 2 1000000
>>> [debian:27936] procdir: /tmp/openmpi-sessions-root_at_debian_0/31810/0/0
>>> [debian:27936] jobdir: /tmp/openmpi-sessions-root_at_debian_0/31810/0
>>> [debian:27936] top: openmpi-sessions-root_at_debian_0
>>> [debian:27936] tmp: /tmp
>>> [debian:27936] [[31810,0],0] hostfile: checking hostfile machinefile for
>>> nodes
>>> [debian:27936] [[31810,0],0] hostfile: filtering nodes through
>>> hostfile machinefile
>>> [debian:27936] progressed_wait: base/plm_base_launch_support.c 436
>>> [debian:27940] procdir: /tmp/openmpi-sessions-root_at_debian_0/31810/0/1
>>> [debian:27940] jobdir: /tmp/openmpi-sessions-root_at_debian_0/31810/0
>>> [debian:27940] top: openmpi-sessions-root_at_debian_0
>>> [debian:27940] tmp: /tmp
>>> [debian:27936] defining message event: base/plm_base_launch_support.c 400
>>> [debian:27936] defining message event: grpcomm_bad_module.c 183
>>> [debian:27936] progressed_wait: base/plm_base_launch_support.c 679
>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor called by
>>> [[31810,0],0] for tag 1
>>> [debian:27936] defining message event: orted/orted_comm.c 382
>>> [debian:27936] [[31810,0],0] node[0].name debian daemon 0 arch ffca0200
>>> [debian:27936] [[31810,0],0] node[1].name debian daemon 1 arch ffca0200
>>> [debian:27936] defining message event: base/odls_base_default_fns.c 1060
>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor: processing
>>> commands completed
>>> [debian:27936] [[31810,0],0] orte:daemon:send_relay
>>> [debian:27936] [[31810,0],0] orte:daemon:send_relay sending relay msg
>>> to [[31810,0],1]
>>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from
>>> [[31810,0],0]
>>> [debian:27940] defining message event: orted/orted_comm.c 277
>>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
>>> [[31810,0],0] for tag 1
>>> [debian:27940] defining message event: orted/orted_comm.c 382
>>> [debian:27940] [[31810,0],1] node[0].name debian daemon 0 arch ffca0200
>>> [debian:27940] [[31810,0],1] node[1].name debian daemon 1 arch ffca0200
>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
>>> commands completed
>>> [debian:27940] [[31810,0],1] orte:daemon:send_relay
>>> [debian:27940] [[31810,0],1] orte:daemon:send_relay - recipient list is
>>> empty!
>>> [debian:27936] defining message event: base/plm_base_launch_support.c 635
>>> [debian:27936] Info: Setting up debugger process table for applications
>>> MPIR_being_debugged = 0
>>> MPIR_debug_state = 1
>>> MPIR_partial_attach_ok = 1
>>> MPIR_i_am_starter = 0
>>> MPIR_proctable_size = 2
>>> MPIR_proctable:
>>> (i, host, exe, pid) = (0, debian, /root/pp/pp, 27941)
>>> (i, host, exe, pid) = (1, debian, /root/pp/pp, 27942)
>>> [debian:27942] procdir: /tmp/openmpi-sessions-root_at_debian_0/31810/1/1
>>> [debian:27941] procdir: /tmp/openmpi-sessions-root_at_debian_0/31810/1/0
>>> [debian:27941] jobdir: /tmp/openmpi-sessions-root_at_debian_0/31810/1
>>> [debian:27941] top: openmpi-sessions-root_at_debian_0
>>> [debian:27941] tmp: /tmp
>>> [debian:27942] jobdir: /tmp/openmpi-sessions-root_at_debian_0/31810/1
>>> [debian:27942] top: openmpi-sessions-root_at_debian_0
>>> [debian:27942] tmp: /tmp
>>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from
>>> [[31810,1],0]
>>> [debian:27940] defining message event: orted/orted_comm.c 277
>>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
>>> [[31810,1],0] for tag 1
>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
>>> commands completed
>>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from
>>> [[31810,1],1]
>>> [debian:27940] defining message event: orted/orted_comm.c 277
>>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
>>> [[31810,1],1] for tag 1
>>> [debian:27936] defining message event: base/routed_base_receive.c 153
>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
>>> commands completed
>>> [debian:27941] progressed_wait: base/routed_base_register_sync.c 104
>>> [debian:27942] progressed_wait: base/routed_base_register_sync.c 104
>>> [debian:27941] [[31810,1],0] node[0].name debian daemon 0 arch ffca0200
>>> [debian:27941] [[31810,1],0] node[1].name debian daemon 1 arch ffca0200
>>> [debian:27942] [[31810,1],1] node[0].name debian daemon 0 arch ffca0200
>>> [debian:27942] [[31810,1],1] node[1].name debian daemon 1 arch ffca0200
>>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from
>>> [[31810,1],0]
>>> [debian:27940] defining message event: orted/orted_comm.c 277
>>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
>>> [[31810,1],0] for tag 1
>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
>>> commands completed
>>> [debian:27941] progressed_wait: grpcomm_bad_module.c 394
>>> [debian:27936] [[31810,0],0] orted_recv_cmd: received message from
>>> [[31810,0],1]
>>> [debian:27936] defining message event: orted/orted_comm.c 277
>>> [debian:27936] [[31810,0],0] orted_recv_cmd: reissued recv
>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor called by
>>> [[31810,0],1] for tag 1
>>> [debian:27936] defining message event: grpcomm_bad_module.c 183
>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor: processing
>>> commands completed
>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor called by
>>> [[31810,0],0] for tag 1
>>> [debian:27936] defining message event: orted/orted_comm.c 382
>>> [debian:27936] [[31810,0],0] orted:comm:message_local_procs delivering
>>> message to job [31810,1] tag 15
>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor: processing
>>> commands completed
>>> [debian:27936] [[31810,0],0] orte:daemon:send_relay
>>> [debian:27936] [[31810,0],0] orte:daemon:send_relay sending relay msg
>>> to [[31810,0],1]
>>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from
>>> [[31810,1],1]
>>> [debian:27940] defining message event: orted/orted_comm.c 277
>>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
>>> [[31810,1],1] for tag 1
>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
>>> commands completed
>>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from
>>> [[31810,0],0]
>>> [debian:27940] defining message event: orted/orted_comm.c 277
>>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
>>> [[31810,0],0] for tag 1
>>> [debian:27940] defining message event: orted/orted_comm.c 382
>>> [debian:27940] [[31810,0],1] orted:comm:message_local_procs delivering
>>> message to job [31810,1] tag 15
>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
>>> commands completed
>>> [debian:27940] [[31810,0],1] orte:daemon:send_relay
>>> [debian:27940] [[31810,0],1] orte:daemon:send_relay - recipient list is
>>> empty!
>>> [debian:27942] progressed_wait: grpcomm_bad_module.c 394
>>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from
>>> [[31810,1],1]
>>> [debian:27940] defining message event: orted/orted_comm.c 277
>>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
>>> [[31810,1],1] for tag 1
>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
>>> commands completed
>>> [debian:27942] progressed_wait: grpcomm_bad_module.c 270
>>> [debian:27936] [[31810,0],0] orted_recv_cmd: received message from
>>> [[31810,0],1]
>>> [debian:27936] defining message event: orted/orted_comm.c 277
>>> [debian:27936] [[31810,0],0] orted_recv_cmd: reissued recv
>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor called by
>>> [[31810,0],1] for tag 1
>>> [debian:27936] defining message event: grpcomm_bad_module.c 183
>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor: processing
>>> commands completed
>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor called by
>>> [[31810,0],0] for tag 1
>>> [debian:27936] defining message event: orted/orted_comm.c 382
>>> [debian:27936] [[31810,0],0] orted:comm:message_local_procs delivering
>>> message to job [31810,1] tag 17
>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor: processing
>>> commands completed
>>> [debian:27936] [[31810,0],0] orte:daemon:send_relay
>>> [debian:27936] [[31810,0],0] orte:daemon:send_relay sending relay msg
>>> to [[31810,0],1]
>>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from
>>> [[31810,1],0]
>>> [debian:27940] defining message event: orted/orted_comm.c 277
>>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
>>> [[31810,1],0] for tag 1
>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
>>> commands completed
>>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from
>>> [[31810,0],0]
>>> [debian:27940] defining message event: orted/orted_comm.c 277
>>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
>>> [[31810,0],0] for tag 1
>>> [debian:27940] defining message event: orted/orted_comm.c 382
>>> [debian:27940] [[31810,0],1] orted:comm:message_local_procs delivering
>>> message to job [31810,1] tag 17
>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
>>> commands completed
>>> [debian:27940] [[31810,0],1] orte:daemon:send_relay
>>> [debian:27940] [[31810,0],1] orte:daemon:send_relay - recipient list is
>>> empty!
>>> [debian:27941] progressed_wait: grpcomm_bad_module.c 270
>>> #
>>> # ping-pong com MPI
>>> #
>>> # msgs from 1 to 2 bytes
>>> # results are the mean of 1000000 repetitions for each msg size
>>> # Tue Aug 12 06:26:29 2008
>>> #
>>> # size lat (us) bw (MB/s)
>>>
>>> ################################################################
>>> 27936 pts/1 S+ 0:00 mpirun -machinefile machinefile -np 2 -am
>>> ft-enable-cr -v -d pp 1 2 1000000
>>> 27937 pts/1 S+ 0:00 /usr/bin/ssh -x debian orted --debug
>>> --heartbeat 0 -mca ess env -mca orte_ess_jobid 2084700160
>>> 27938 ? Ss 0:00 sshd: root_at_notty
>>> 27940 ? Ss 0:00 orted --debug --heartbeat 0 -mca ess env
>>> -mca orte_ess_jobid 2084700160 -mca orte_ess_vpid 1 -mc
>>> 27941 ? Rl 0:21 pp 1 2 1000000
>>> 27942 ? Rl 0:21 pp 1 2 1000000
>>> 28021 pts/0 R+ 0:00 ps xa
>>>
>>> root_at_debian:~/pp# ompi-checkpoint 27936 -v
>>> [debian:28022] [[31764,0],0] ORTE_ERROR_LOG: Not found in file
>>> orte-checkpoint.c at line 395
>>> [debian:28022] HNP with PID 27936 Not found!
>>>
>>> ################################################################
>>>
>>> Regards,
>>> Caciano Machado
>>> Computer Science Graduate Student/UFRGS
>>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>