Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] checkpoint opempi-1.3.3+sge62
From: Sergio Díaz (sdiaz_at_[hidden])
Date: 2009-12-14 12:25:27


Hi Reuti,

Yes, I sent a job with SGE and I checkpointed the mpirun process, by
hand, entering into the mpi master node. Then I killed the job with qdel
and after that I did the ompi-restart.
I will try to integrate with SGE creating a ckpt environment but I think
that it could be a bit difficult because:
         1 - when I do checkpoint I can't specify a directory with a
name like checkpoint_jobid
         2 - I can't specify the scratch directory and I have to use
the /tmp instead of SGE's scratch directory.
         3 - I tried to restart the snapshot and it only works if I use
the same machinefile. That is, If the job ran in the c3-13 and c3-14, I
have to restart the job using a machinefile with these two nodes.

                    [sdiaz_at_svgd ~]$ ompi-restart -v -machinefile
mpi_test/lanzar_pi3.sh.po3117822 ompi_global_snapshot_12554.ckpt
                    [svgd.cesga.es:28836] Checking for the existence of
(/home/cesga/sdiaz/ompi_global_snapshot_12554.ckpt)
                    [svgd.cesga.es:28836] Restarting from file
(ompi_global_snapshot_12554.ckpt)
                    [svgd.cesga.es:28836] Exec in self
                     tiempo 110
                     Process 1 :
                     compute-3-14.local
                                        of 2
                     tiempo 110
                     Process 0 :
                     compute-3-13.local
                                       of 2
                        
--------------------------------------------------------------------------
                        mpirun noticed that process rank 1 with PID 8477
on node compute-3-15 exited on signal 11 (Segmentation fault).
                        
--------------------------------------------------------------------------

To solve problem 1, there is a feature opened by Josh.
(https://svn.open-mpi.org/trac/ompi/ticket/2098)
To solve problem 2, there is a thread in which is talked ([OMPI users]
Changing location where checkpoints are saved) and also a bug opened by
Josh. https://svn.open-mpi.org/trac/ompi/ticket/2139 . I think that it
could work... we will see.
To solve problem 3, I didn't have time to search it. But if Josh or
anyone have an idea... please tell to us :-)

Reuti, Did you test it successfully? How do you solve these problems?

Regards,
Sergio

Reuti escribió:
> Hi,
>
> Am 14.12.2009 um 17:05 schrieb Sergio Díaz:
>
>> I got a successful checkpoint with a fresh installation and without
>> use the trunk. I can't understand why it is working now and before I
>> could do a successful restart... Maybe there was something wrong in
>> the openmpi installation and then the metadata was created in a wrong
>> way.
>> I will test it more and also I will test the trunk.
>>
>> Regards,
>> Sergio
>>
>> [sdiaz_at_compute-3-13 ~]$ ompi-restart -machinefile
>> mpi_test/lanzar_pi3.sh.po3117822 ompi_global_snapshot_12554.ckpt
>> tiempo 110
>> Process 1 :
>> compute-3-14.local
>> of 2
>> tiempo 110
>> Process 0 :
>> compute-3-13.local
>> of 2
>> tiempo 120
>> Process 1 :
>> compute-3-14.local
>> of 2
>> tiempo 120
>> Process 0 :
>> compute-3-13.local
>> ...
>> ...
>>
>> [sdiaz_at_compute-3-14 ~]$ ps auxf |grep sdiaz
>> sdiaz 26273 0.0 0.0 34676 1668 ? Ss 15:58 0:00 orted
>> --daemonize -mca ess env -mca orte_ess_jobid 1739128832 -mca
>
> in a Tight Integration into SGE the daemon should get the argument
> --no-daemonize. Are you restarting a job on the command line, which
> ran before under SGE's supervision?
>
> -- Reuti
>
>
>> orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri
>> 1739128832.0;tcp://192.168.4.148:45551 -mca
>> mca_base_param_file_prefix ft-enable-cr -mca mca_base_param_file_path
>> /opt/cesga/openmpi-1.3.3_bis/share/openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz
>> -mca mca_base_param_file_path_force /home_no_usc/cesga/sdiaz
>> sdiaz 26274 0.1 0.0 15984 504 ? Sl 15:58 0:00 \_
>> cr_restart
>> /home/cesga/sdiaz/ompi_global_snapshot_12554.ckpt/0/opal_snapshot_1.ckpt/ompi_blcr_context.26047
>>
>> sdiaz 26047 1.5 0.0 99460 3624 ? Sl 15:58 0:00
>> \_ ./pi3
>>
>> [sdiaz_at_compute-3-13 ~]$ ps auxf |grep sdiaz
>> root 12878 0.0 0.0 90260 3000 pts/0 S 15:55 0:00
>> | \_ su - sdiaz
>> sdiaz 12880 0.0 0.0 53432 1512 pts/0 S 15:55 0:00
>> | \_ -bash
>> sdiaz 13070 0.3 0.0 39988 2500 pts/0 S+ 15:58 0:00
>> | \_ mpirun -am ft-enable-cr --default-hostfile
>> mpi_test/lanzar_pi3.sh.po3117822 --app
>> /home/cesga/sdiaz/ompi_global_snapshot_12554.ckpt/restart-appfile
>> sdiaz 13073 0.0 0.0 15988 508 pts/0 Sl+ 15:58 0:00
>> | \_ cr_restart
>> /home/cesga/sdiaz/ompi_global_snapshot_12554.ckpt/0/opal_snapshot_0.ckpt/ompi_blcr_context.12558
>>
>> sdiaz 12558 0.2 0.0 99464 3616 pts/0 Sl+ 15:58 0:00
>> | \_ ./pi3
>>
>>
>> Sergio Díaz escribió:
>>>
>>> Hi Josh
>>>
>>> Here you go the file.
>>>
>>> I will try to apply the trunk but I think that I broke-up my openmpi
>>> installation doing "something" and I don't know what :-( . I was
>>> modifying the mca parameters...
>>> When I send a job, the orted daemon expanded in the SLAVE host is
>>> launched in a bucle till they spend all the reserved memory.
>>> It is very strange so I will compile it again, I will reproduce the
>>> bug and then I will test the trunk.
>>>
>>> Thanks a lot for the support and tickets opened.
>>> Sergio
>>>
>>>
>>> sdiaz 30279 0.0 0.0 1888 560 ? Ds 12:54 0:00
>>> \_ /opt/cesga/sge62/utilbin/lx24-x86/qrsh_starter
>>> /opt/cesga/sge62/default/spool/compute
>>> sdiaz 30286 0.0 0.0 52772 1188 ? D 12:54
>>> 0:00 \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted -mca
>>> ess env -mca orte_ess_jobid 219
>>> sdiaz 30322 0.0 0.0 52772 1188 ? S 12:54
>>> 0:00 \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted
>>> sdiaz 30358 0.0 0.0 52772 1188 ? D 12:54
>>> 0:00 \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted
>>> sdiaz 30394 0.0 0.0 52772 1188 ? D 12:54
>>> 0:00 \_ /bin/bash
>>> /opt/cesga/openmpi-1.3.3/bin/orted
>>> sdiaz 30430 0.0 0.0 52772 1188 ? D 12:54
>>> 0:00 \_ /bin/bash
>>> /opt/cesga/openmpi-1.3.3/bin/orted
>>> sdiaz 30466 0.0 0.0 52772 1188 ? D 12:54
>>> 0:00 \_ /bin/bash
>>> /opt/cesga/openmpi-1.3.3/bin/orted
>>> sdiaz 30502 0.0 0.0 52772 1188 ? D 12:54
>>> 0:00 \_ /bin/bash
>>> /opt/cesga/openmpi-1.3.3/bin/orted
>>> sdiaz 30538 0.0 0.0 52772 1188 ? D 12:54
>>> 0:00 \_ /bin/bash
>>> /opt/cesga/openmpi-1.3.3/bin/orted
>>> sdiaz 30574 0.0 0.0 52772 1188 ? D 12:54
>>> 0:00 \_ /bin/bash
>>> /opt/cesga/openmpi-1.3.3/bin/orted
>>> ....
>>>
>>>
>>>
>>> Josh Hursey escribió:
>>>>
>>>>
>>>> On Nov 12, 2009, at 10:54 AM, Sergio Díaz wrote:
>>>>
>>>>> Hi Josh,
>>>>>
>>>>> You were right. The main problem was the /tmp. SGE uses a scratch
>>>>> directory in which the jobs have temporary files. Setting TMPDIR
>>>>> to /tmp, checkpoint works!
>>>>> However, when I try to restart it... I got the following error
>>>>> (see ERROR1). Option -v agrees these lines (see ERRO2).
>>>>
>>>> It is concerning that ompi-restart is segfault'ing when it errors
>>>> out. The error message is being generated between the launch of the
>>>> opal-restart starter command and when we try to exec(cr_restart).
>>>> Usually the failure is related to a corruption of the metadata
>>>> stored in the checkpoint.
>>>>
>>>> Can you send me the file below:
>>>> ompi_global_snapshot_28454.ckpt/0/opal_snapshot_0.ckpt/snapshot_meta.data
>>>>
>>>>
>>>> I was able to reproduce the segv (at least I think it is the same
>>>> one). We failed to check the validity of a string when we parse the
>>>> metadata. I committed a fix to the trunk in r22290, and requested
>>>> that the fix be moved to the v1.4 and v1.5 branches. If you are
>>>> interested in seeing when they get applied you can follow the
>>>> following tickets:
>>>> https://svn.open-mpi.org/trac/ompi/ticket/2140
>>>> https://svn.open-mpi.org/trac/ompi/ticket/2141
>>>>
>>>> Can you try the trunk to see if the problem goes away? The
>>>> development trunk and v1.5 series have a bunch of improvements to
>>>> the C/R functionality that were never brought over the v1.3/v1.4
>>>> series.
>>>>
>>>>>
>>>>> I was trying to use ssh instead of rsh but I was impossible. By
>>>>> default it should use ssh and if it finds a problem, it will use
>>>>> rsh. It seems that ssh doesn't work because always use rsh.
>>>>> If I change this MCA parameter, It still uses rsh.
>>>>> If I set OMPI_MCA_plm_rsh_disable_qrsh variable to 1, It try to
>>>>> use ssh and doesn't works. I got --> "bash: orted: command not
>>>>> found" and the mpi process dies.
>>>>> The command which try to execute is the following and I haven't
>>>>> found yet the reason why this command doesn't found orted because
>>>>> I set the /etc/bashrc in order to get always the right path and I
>>>>> have the right path into my application. (see ERROR4).
>>>>
>>>> This seems like an SGE specific issue, so a bit out of my domain.
>>>> Maybe others have suggestions here.
>>>>
>>>> -- Josh
>>>>>
>>>>>
>>>>> Many thanks!,
>>>>> Sergio
>>>>>
>>>>> P.S. Sorry about these long emails. I just try to show you useful
>>>>> information to identify my problems.
>>>>>
>>>>>
>>>>> ERROR 1
>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>
>>>>> > [sdiaz_at_compute-3-18 ~]$ ompi-restart
>>>>> ompi_global_snapshot_28454.ckpt
>>>>> >
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> > Error: Unable to obtain the proper restart command to restart
>>>>> from the
>>>>> > checkpoint file (opal_snapshot_0.ckpt). Returned -1.
>>>>> >
>>>>> >
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> >
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> > Error: Unable to obtain the proper restart command to restart
>>>>> from the
>>>>> > checkpoint file (opal_snapshot_1.ckpt). Returned -1.
>>>>> >
>>>>> >
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> > [compute-3-18:28792] *** Process received signal ***
>>>>> > [compute-3-18:28792] Signal: Segmentation fault (11)
>>>>> > [compute-3-18:28792] Signal code: (128)
>>>>> > [compute-3-18:28792] Failing at address: (nil)
>>>>> > [compute-3-18:28792] [ 0] /lib64/tls/libpthread.so.0 [0x33bbf0c430]
>>>>> > [compute-3-18:28792] [ 1] /lib64/tls/libc.so.6(__libc_free+0x25)
>>>>> [0x33bb669135]
>>>>> > [compute-3-18:28792] [ 2]
>>>>> /opt/cesga/openmpi-1.3.3/lib/libopen-pal.so.0(opal_argv_free+0x2e)
>>>>> [0x2a95586658]
>>>>> > [compute-3-18:28792] [ 3]
>>>>> /opt/cesga/openmpi-1.3.3/lib/libopen-pal.so.0(opal_event_fini+0x1e)
>>>>> [0x2a9557906e]
>>>>> > [compute-3-18:28792] [ 4]
>>>>> /opt/cesga/openmpi-1.3.3/lib/libopen-pal.so.0(opal_finalize+0x36)
>>>>> [0x2a9556bcfa]
>>>>> > [compute-3-18:28792] [ 5] opal-restart [0x40312a]
>>>>> > [compute-3-18:28792] [ 6]
>>>>> /lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x33bb61c3fb]
>>>>> > [compute-3-18:28792] [ 7] opal-restart [0x40272a]
>>>>> > [compute-3-18:28792] *** End of error message ***
>>>>> > [compute-3-18:28793] *** Process received signal ***
>>>>> > [compute-3-18:28793] Signal: Segmentation fault (11)
>>>>> > [compute-3-18:28793] Signal code: (128)
>>>>> > [compute-3-18:28793] Failing at address: (nil)
>>>>> > [compute-3-18:28793] [ 0] /lib64/tls/libpthread.so.0 [0x33bbf0c430]
>>>>> > [compute-3-18:28793] [ 1] /lib64/tls/libc.so.6(__libc_free+0x25)
>>>>> [0x33bb669135]
>>>>> > [compute-3-18:28793] [ 2]
>>>>> /opt/cesga/openmpi-1.3.3/lib/libopen-pal.so.0(opal_argv_free+0x2e)
>>>>> [0x2a95586658]
>>>>> > [compute-3-18:28793] [ 3]
>>>>> /opt/cesga/openmpi-1.3.3/lib/libopen-pal.so.0(opal_event_fini+0x1e)
>>>>> [0x2a9557906e]
>>>>> > [compute-3-18:28793] [ 4]
>>>>> /opt/cesga/openmpi-1.3.3/lib/libopen-pal.so.0(opal_finalize+0x36)
>>>>> [0x2a9556bcfa]
>>>>> > [compute-3-18:28793] [ 5] opal-restart [0x40312a]
>>>>> > [compute-3-18:28793] [ 6]
>>>>> /lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x33bb61c3fb]
>>>>> > [compute-3-18:28793] [ 7] opal-restart [0x40272a]
>>>>> > [compute-3-18:28793] *** End of error message ***
>>>>> >
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> > mpirun noticed that process rank 0 with PID 28792 on node
>>>>> compute-3-18.local exited on signal 11 (Segmentation fault).
>>>>> >
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ERROR 2
>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>
>>>>> > [sdiaz_at_compute-3-18 ~]$ ompi-restart -v
>>>>> ompi_global_snapshot_28454.ckpt
>>>>> >[compute-3-18.local:28941] Checking for the existence of
>>>>> (/home/cesga/sdiaz/ompi_global_snapshot_28454.ckpt)
>>>>> > [compute-3-18.local:28941] Restarting from file
>>>>> (ompi_global_snapshot_28454.ckpt)
>>>>> > [compute-3-18.local:28941] Exec in self
>>>>> > .......
>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ERROR3
>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>
>>>>> >[sdiaz_at_compute-3-18 ~]$ ompi_info --all|grep "plm_rsh_agent"
>>>>> > How many plm_rsh_agent instances to invoke concurrently
>>>>> (must be > 0)
>>>>> > MCA plm: parameter "plm_rsh_agent" (current value: "ssh
>>>>> : rsh", data source: default value, synonyms: pls_rsh_agent)
>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>
>>>>>
>>>>> ERROR4
>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>
>>>>> >/usr/bin/ssh -x compute-3-17.local orted --debug-daemons -mca
>>>>> ess env -mca orte_ess_jobid 2152464384 -mca orte_ess_vpid 1 -mca
>>>>> orte_ess_num_procs 2 --hnp-uri
>>>>> >"2152464384.0;tcp://192.168.4.143:59176" -mca
>>>>> mca_base_param_file_prefix ft-enable-cr -mca
>>>>> mca_base_param_file_path
>>>>> >/opt/cesga/openmpi-1.3.3/share/openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz/mpi_test
>>>>> -mca mca_base_param_file_path_force /home_no_usc/cesga/sdiaz/mpi_test
>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Josh Hursey escribió:
>>>>>>
>>>>>> On Nov 9, 2009, at 5:33 AM, Sergio Díaz wrote:
>>>>>>
>>>>>>
>>>>>>> Hi Josh,
>>>>>>>
>>>>>>> The OpenMPI version is 1.3.3.
>>>>>>>
>>>>>>> The command ompi-ps doesn't work.
>>>>>>>
>>>>>>> [root_at_compute-3-18 ~]# ompi-ps -j 2726959 -p 16241
>>>>>>> [root_at_compute-3-18 ~]# ompi-ps -v -j 2726959 -p 16241
>>>>>>> [compute-3-18.local:16254] orte_ps: Acquiring list of HNPs and
>>>>>>> setting contact info into RML...
>>>>>>> [root_at_compute-3-18 ~]# ompi-ps -v -j 2726959
>>>>>>> [compute-3-18.local:16255] orte_ps: Acquiring list of HNPs and
>>>>>>> setting contact info into RML...
>>>>>>>
>>>>>>> [root_at_compute-3-18 ~]# ps uaxf | grep sdiaz
>>>>>>> root 16260 0.0 0.0 51084 680 pts/0 S+ 13:38
>>>>>>> 0:00 \_ grep sdiaz
>>>>>>> sdiaz 16203 0.0 0.0 53164 1220 ? Ss 13:37
>>>>>>> 0:00 \_ -bash
>>>>>>> /opt/cesga/sge62/default/spool/compute-3-18/job_scripts/2726959
>>>>>>> sdiaz 16241 0.0 0.0 41028 2480 ? S 13:37
>>>>>>> 0:00 \_ mpirun -np 2 -am ft-enable-cr ./pi3
>>>>>>> sdiaz 16242 0.0 0.0 36484 1840 ? Sl 13:37
>>>>>>> 0:00 \_ /opt/cesga/sge62/bin/lx24-x86/qrsh -inherit
>>>>>>> -nostdin -V compute-3-17.local orted -mca ess env -mca
>>>>>>> orte_ess_jobid 2769879040 -mca orte_ess_vpid 1 -mca
>>>>>>> orte_ess_num_procs 2 --hnp-uri
>>>>>>> "2769879040.0;tcp://192.168.4.143:57010" -mca
>>>>>>> mca_base_param_file_prefix ft-enable-cr -mca
>>>>>>> mca_base_param_file_path
>>>>>>> /opt/cesga/openmpi-1.3.3/share/openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz/mpi_test
>>>>>>> -mca mca_base_param_file_path_force
>>>>>>> /home_no_usc/cesga/sdiaz/mpi_test
>>>>>>> sdiaz 16245 0.1 0.0 99464 4616 ? Sl 13:37
>>>>>>> 0:00 \_ ./pi3
>>>>>>>
>>>>>>> [root_at_compute-3-18 ~]# ompi-ps -n c3-18
>>>>>>> [root_at_compute-3-18 ~]# ompi-ps -n compute-3-18
>>>>>>> [root_at_compute-3-18 ~]# ompi-ps -n
>>>>>>>
>>>>>>> There is not directory on the /tmp of the node. However, if the
>>>>>>> application is run without SGE, the directory is created
>>>>>>>
>>>>>> This may be the core of the problem. ompi-ps and other command
>>>>>> line tools (e.g., ompi-checkpoint) look for the Open MPI session
>>>>>> directory in /tmp in order to find the connection information to
>>>>>> connect to the mpirun process (internally called the HNP or Head
>>>>>> Node Process).
>>>>>>
>>>>>> Can you change the location of the temporary directory in SGE?
>>>>>> The temporary directory is usually set via an environment
>>>>>> variable (e.g., TMPDIR, or TMP). So removing the environment
>>>>>> variable or setting it to /tmp might help.
>>>>>>
>>>>>>
>>>>>>
>>>>>>> but if I do ompi-ps -j MPIRUN_PID, it seems hanged and I
>>>>>>> interrupt it. Does it take long time?
>>>>>>>
>>>>>> It should not take a long time. It is just querying the mpirun
>>>>>> process for state information.
>>>>>>
>>>>>>
>>>>>>> what means the option -j of ompi-ps command? isn't it related to
>>>>>>> a batch system(like sge, condor...), is it?
>>>>>>>
>>>>>> The '-j' option allows the user to specify the Open MPI jobid.
>>>>>> This is completely different than the jobid provided by the batch
>>>>>> system. In general, users should not need to specify the -j
>>>>>> option. It is useful when you have multiple Open MPI jobs, and
>>>>>> want a summary of just one of them.
>>>>>>
>>>>>>
>>>>>>> Thanks for the ticket. I will follow it.
>>>>>>>
>>>>>>> Talking with Alan, I realized that there are few transport
>>>>>>> protocols that are supported. And maybe it is the problem.
>>>>>>> Currently, SGE is using qrsh to expand mpi process. I can change
>>>>>>> this protocol and use ssh. So, I'm going to test it this
>>>>>>> afternoon and I will comment to you the results.
>>>>>>>
>>>>>> Try 'ssh' and see if that helps. I suspect the problem is with
>>>>>> the session directory location though.
>>>>>>
>>>>>>
>>>>>>> Regards,
>>>>>>> Sergio
>>>>>>>
>>>>>>>
>>>>>>> Josh Hursey escribió:
>>>>>>>
>>>>>>>> On Oct 28, 2009, at 7:41 AM, Sergio Díaz wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> I have achieved the checkpoint of an easy program without SGE.
>>>>>>>>> Now, I'm trying to do the integration openmpi+sge but I have
>>>>>>>>> some problems... When I try to do checkpoint of the mpirun
>>>>>>>>> PID, I got an error similar to the error gotten when the PID
>>>>>>>>> doesn't exit. The example below.
>>>>>>>>>
>>>>>>>> I do not have any experience with the SGE environment, so I
>>>>>>>> suspect that there may something 'special' about the
>>>>>>>> environment that is tripping up the ompi-checkpoint tool.
>>>>>>>>
>>>>>>>> First of all, what version of Open MPI are you using?
>>>>>>>>
>>>>>>>> Somethings to check:
>>>>>>>> - Does 'ompi-ps' work when your application is running?
>>>>>>>> - Is there an /tmp/openmpi-sessions-* directory on the node
>>>>>>>> where mpirun is currently running? This directory contains
>>>>>>>> information on how to connect to the mpirun process from an
>>>>>>>> external tool, if it's missing then this could be the cause of
>>>>>>>> the problem.
>>>>>>>>
>>>>>>>>
>>>>>>>>> Any ideas?
>>>>>>>>> Somebody have a script to do it automatic with SGE?. For
>>>>>>>>> example I have one to do checkpoint each X seconds with BLCR
>>>>>>>>> and non-mpi jobs. It is launched by SGE if you have configured
>>>>>>>>> the queue and the ckpt environment.
>>>>>>>>>
>>>>>>>> I do not know of any integration of the Open MPI checkpointing
>>>>>>>> work with SGE at the moment.
>>>>>>>>
>>>>>>>> As far as time triggered checkpointing, I have a feature ticket
>>>>>>>> open about this:
>>>>>>>> https://svn.open-mpi.org/trac/ompi/ticket/1961
>>>>>>>>
>>>>>>>> It is not available yet, but in the works.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> Is it possible choose the name of the ckpt folder when you do
>>>>>>>>> the ompi-checkpoint? I can't find the option to do it.
>>>>>>>>>
>>>>>>>> Not at this time. Though I could see it as a useful feature,
>>>>>>>> and shouldn't be too hard to implement. I filed a ticket if you
>>>>>>>> want to follow the progress:
>>>>>>>> https://svn.open-mpi.org/trac/ompi/ticket/2098
>>>>>>>>
>>>>>>>> -- Josh
>>>>>>>>> Regards,
>>>>>>>>> Sergio
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --------------------------------
>>>>>>>>>
>>>>>>>>> [sdiaz_at_compute-3-17 ~]$ ps auxf
>>>>>>>>> ....
>>>>>>>>> root 20044 0.0 0.0 4468 1224 ? S 13:28
>>>>>>>>> 0:00 \_ sge_shepherd-2645150 -bg
>>>>>>>>> sdiaz 20072 0.0 0.0 53172 1212 ? Ss 13:28
>>>>>>>>> 0:00 \_ -bash
>>>>>>>>> /opt/cesga/sge62/default/spool/compute-3-17/job_scripts/2645150
>>>>>>>>> sdiaz 20112 0.2 0.0 41028 2480 ? S 13:28
>>>>>>>>> 0:00 \_ mpirun -np 2 -am ft-enable-cr pi3
>>>>>>>>> sdiaz 20113 0.0 0.0 36484 1824 ? Sl 13:28
>>>>>>>>> 0:00 \_ /opt/cesga/sge62/bin/lx24-x86/qrsh
>>>>>>>>> -inherit -nostdin -V compute-3-18..........
>>>>>>>>> sdiaz 20116 1.2 0.0 99464 4616 ? Sl 13:28
>>>>>>>>> 0:00 \_ pi3
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint 20112
>>>>>>>>> [compute-3-17.local:20124] HNP with PID 20112 Not found!
>>>>>>>>>
>>>>>>>>> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint -s 20112
>>>>>>>>> [compute-3-17.local:20135] HNP with PID 20112 Not found!
>>>>>>>>>
>>>>>>>>> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint -s --term 20112
>>>>>>>>> [compute-3-17.local:20136] HNP with PID 20112 Not found!
>>>>>>>>>
>>>>>>>>> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint --hnp-pid 20112
>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> ompi-checkpoint PID_OF_MPIRUN
>>>>>>>>> Open MPI Checkpoint Tool
>>>>>>>>>
>>>>>>>>> -am <arg0> Aggregate MCA parameter set file list
>>>>>>>>> -gmca|--gmca <arg0> <arg1>
>>>>>>>>> Pass global MCA parameters that are
>>>>>>>>> applicable to
>>>>>>>>> all contexts (arg0 is the parameter
>>>>>>>>> name; arg1 is
>>>>>>>>> the parameter value)
>>>>>>>>> -h|--help This help message
>>>>>>>>> --hnp-jobid <arg0> This should be the jobid of the HNP
>>>>>>>>> whose
>>>>>>>>> applications you wish to checkpoint.
>>>>>>>>> --hnp-pid <arg0> This should be the pid of the mpirun
>>>>>>>>> whose
>>>>>>>>> applications you wish to checkpoint.
>>>>>>>>> -mca|--mca <arg0> <arg1>
>>>>>>>>> Pass context-specific MCA parameters;
>>>>>>>>> they are
>>>>>>>>> considered global if --gmca is not
>>>>>>>>> used and only
>>>>>>>>> one context is specified (arg0 is the
>>>>>>>>> parameter
>>>>>>>>> name; arg1 is the parameter value)
>>>>>>>>> -s|--status Display status messages describing
>>>>>>>>> the progression
>>>>>>>>> of the checkpoint
>>>>>>>>> --term Terminate the application after
>>>>>>>>> checkpoint
>>>>>>>>> -v|--verbose Be Verbose
>>>>>>>>> -w|--nowait Do not wait for the application to
>>>>>>>>> finish
>>>>>>>>> checkpointing before returning
>>>>>>>>>
>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> [sdiaz_at_compute-3-17 ~]$ exit
>>>>>>>>> logout
>>>>>>>>> Connection to c3-17 closed.
>>>>>>>>> [sdiaz_at_svgd mpi_test]$ ssh c3-18
>>>>>>>>> Last login: Wed Oct 28 13:24:12 2009 from svgd.local
>>>>>>>>> -bash-3.00$ ps auxf |grep sdiaz
>>>>>>>>>
>>>>>>>>> sdiaz 14412 0.0 0.0 1888 560 ? Ss 13:28
>>>>>>>>> 0:00 \_ /opt/cesga/sge62/utilbin/lx24-x86/qrsh_starter
>>>>>>>>> /opt/cesga/sge62/default/spool/compute-3-18/active_jobs/2645150.1/1.compute-3-18
>>>>>>>>>
>>>>>>>>> sdiaz 14419 0.0 0.0 35728 2260 ? S 13:28
>>>>>>>>> 0:00 \_ orted -mca ess env -mca orte_ess_jobid
>>>>>>>>> 2295267328 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2
>>>>>>>>> --hnp-uri 2295267328.0;tcp://192.168.4.144:36596 -mca
>>>>>>>>> mca_base_param_file_prefix ft-enable-cr -mca
>>>>>>>>> mca_base_param_file_path
>>>>>>>>> /opt/cesga/openmpi-1.3.3/share/openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz/mpi_test
>>>>>>>>> -mca mca_base_param_file_path_force
>>>>>>>>> /home_no_usc/cesga/sdiaz/mpi_test
>>>>>>>>> sdiaz 14420 0.0 0.0 99452 4596 ? Sl 13:28
>>>>>>>>> 0:00 \_ pi3
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Sergio Díaz Montes
>>>>>>>>> Centro de Supercomputacion de Galicia
>>>>>>>>> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela
>>>>>>>>> (Spain)
>>>>>>>>> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
>>>>>>>>> email: sdiaz_at_[hidden] ; http://www.cesga.es/
>>>>>>>>> <image002.jpg>
>>>>>>>>> ------------------------------------------------
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>>
>>>>>>> --
>>>>>>> Sergio Díaz Montes
>>>>>>> Centro de Supercomputacion de Galicia
>>>>>>> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela
>>>>>>> (Spain)
>>>>>>> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
>>>>>>> email: sdiaz_at_[hidden] ; http://www.cesga.es/
>>>>>>> <image002.jpg>
>>>>>>> ------------------------------------------------
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sergio Díaz Montes
>>>>> Centro de Supercomputacion de Galicia
>>>>> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
>>>>> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
>>>>> email: sdiaz_at_[hidden] ; http://www.cesga.es/
>>>>> <image002.jpg>
>>>>> ------------------------------------------------
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>>
>>> --
>>> Sergio Díaz Montes
>>> Centro de Supercomputacion de Galicia
>>> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
>>> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
>>> email: sdiaz_at_[hidden] ; http://www.cesga.es/
>>> <mime-attachment.jpeg>
>>> ------------------------------------------------
>>> _______________________________________________ users mailing list
>>> users_at_[hidden] http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Sergio Díaz Montes
>> Centro de Supercomputacion de Galicia
>> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
>> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
>> email: sdiaz_at_[hidden] ; http://www.cesga.es/
>> <image002.jpg>
>> ------------------------------------------------
>> <mime-attachment.jpeg><image002.jpg>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
Sergio Díaz Montes
Centro de Supercomputacion de Galicia
Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
email: sdiaz_at_[hidden] ; http://www.cesga.es/
------------------------------------------------



image002.jpg