Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] checkpoint opempi-1.3.3+sge62
From: Sergio Díaz (sdiaz_at_[hidden])
Date: 2009-12-14 11:05:46


Hi Josh,

I got a successful checkpoint with a fresh installation and without use
the trunk. I can't understand why it is working now and before I could
do a successful restart... Maybe there was something wrong in the
openmpi installation and then the metadata was created in a wrong way.
I will test it more and also I will test the trunk.

Regards,
Sergio

[sdiaz_at_compute-3-13 ~]$ ompi-restart -machinefile
mpi_test/lanzar_pi3.sh.po3117822 ompi_global_snapshot_12554.ckpt
 tiempo 110
 Process 1 :
 compute-3-14.local
                     of 2
 tiempo 110
 Process 0 :
 compute-3-13.local
                     of 2
 tiempo 120
 Process 1 :
 compute-3-14.local
                     of 2
 tiempo 120
 Process 0 :
 compute-3-13.local
...
...
                     
[sdiaz_at_compute-3-14 ~]$ ps auxf |grep sdiaz
sdiaz 26273 0.0 0.0 34676 1668 ? Ss 15:58 0:00 orted
--daemonize -mca ess env -mca orte_ess_jobid 1739128832 -mca
orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri
1739128832.0;tcp://192.168.4.148:45551 -mca mca_base_param_file_prefix
ft-enable-cr -mca mca_base_param_file_path
/opt/cesga/openmpi-1.3.3_bis/share/openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz
-mca mca_base_param_file_path_force /home_no_usc/cesga/sdiaz
sdiaz 26274 0.1 0.0 15984 504 ? Sl 15:58 0:00 \_
cr_restart
/home/cesga/sdiaz/ompi_global_snapshot_12554.ckpt/0/opal_snapshot_1.ckpt/ompi_blcr_context.26047
sdiaz 26047 1.5 0.0 99460 3624 ? Sl 15:58 0:00 \_ ./pi3

[sdiaz_at_compute-3-13 ~]$ ps auxf |grep sdiaz
root 12878 0.0 0.0 90260 3000 pts/0 S 15:55 0:00 |
\_ su - sdiaz
sdiaz 12880 0.0 0.0 53432 1512 pts/0 S 15:55 0:00
| \_ -bash
sdiaz 13070 0.3 0.0 39988 2500 pts/0 S+ 15:58 0:00
| \_ mpirun -am ft-enable-cr --default-hostfile
mpi_test/lanzar_pi3.sh.po3117822 --app
/home/cesga/sdiaz/ompi_global_snapshot_12554.ckpt/restart-appfile
sdiaz 13073 0.0 0.0 15988 508 pts/0 Sl+ 15:58 0:00
| \_ cr_restart
/home/cesga/sdiaz/ompi_global_snapshot_12554.ckpt/0/opal_snapshot_0.ckpt/ompi_blcr_context.12558
sdiaz 12558 0.2 0.0 99464 3616 pts/0 Sl+ 15:58 0:00
| \_ ./pi3

Sergio Díaz escribió:
> Hi Josh
>
> Here you go the file.
>
> I will try to apply the trunk but I think that I broke-up my openmpi
> installation doing "something" and I don't know what :-( . I was
> modifying the mca parameters...
> When I send a job, the orted daemon expanded in the SLAVE host is
> launched in a bucle till they spend all the reserved memory.
> It is very strange so I will compile it again, I will reproduce the
> bug and then I will test the trunk.
>
> Thanks a lot for the support and tickets opened.
> Sergio
>
>
> sdiaz 30279 0.0 0.0 1888 560 ? Ds 12:54 0:00 \_
> /opt/cesga/sge62/utilbin/lx24-x86/qrsh_starter
> /opt/cesga/sge62/default/spool/compute
> sdiaz 30286 0.0 0.0 52772 1188 ? D 12:54
> 0:00 \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted -mca ess
> env -mca orte_ess_jobid 219
> sdiaz 30322 0.0 0.0 52772 1188 ? S 12:54
> 0:00 \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted
> sdiaz 30358 0.0 0.0 52772 1188 ? D 12:54
> 0:00 \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted
> sdiaz 30394 0.0 0.0 52772 1188 ? D 12:54
> 0:00 \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted
> sdiaz 30430 0.0 0.0 52772 1188 ? D 12:54
> 0:00 \_ /bin/bash
> /opt/cesga/openmpi-1.3.3/bin/orted
> sdiaz 30466 0.0 0.0 52772 1188 ? D 12:54
> 0:00 \_ /bin/bash
> /opt/cesga/openmpi-1.3.3/bin/orted
> sdiaz 30502 0.0 0.0 52772 1188 ? D 12:54
> 0:00 \_ /bin/bash
> /opt/cesga/openmpi-1.3.3/bin/orted
> sdiaz 30538 0.0 0.0 52772 1188 ? D 12:54
> 0:00 \_ /bin/bash
> /opt/cesga/openmpi-1.3.3/bin/orted
> sdiaz 30574 0.0 0.0 52772 1188 ? D 12:54
> 0:00 \_ /bin/bash
> /opt/cesga/openmpi-1.3.3/bin/orted
> ....
>
>
>
> Josh Hursey escribió:
>>
>> On Nov 12, 2009, at 10:54 AM, Sergio Díaz wrote:
>>
>>> Hi Josh,
>>>
>>> You were right. The main problem was the /tmp. SGE uses a scratch
>>> directory in which the jobs have temporary files. Setting TMPDIR to
>>> /tmp, checkpoint works!
>>> However, when I try to restart it... I got the following error (see
>>> ERROR1). Option -v agrees these lines (see ERRO2).
>>
>> It is concerning that ompi-restart is segfault'ing when it errors
>> out. The error message is being generated between the launch of the
>> opal-restart starter command and when we try to exec(cr_restart).
>> Usually the failure is related to a corruption of the metadata stored
>> in the checkpoint.
>>
>> Can you send me the file below:
>> ompi_global_snapshot_28454.ckpt/0/opal_snapshot_0.ckpt/snapshot_meta.data
>>
>>
>> I was able to reproduce the segv (at least I think it is the same
>> one). We failed to check the validity of a string when we parse the
>> metadata. I committed a fix to the trunk in r22290, and requested
>> that the fix be moved to the v1.4 and v1.5 branches. If you are
>> interested in seeing when they get applied you can follow the
>> following tickets:
>> https://svn.open-mpi.org/trac/ompi/ticket/2140
>> https://svn.open-mpi.org/trac/ompi/ticket/2141
>>
>> Can you try the trunk to see if the problem goes away? The
>> development trunk and v1.5 series have a bunch of improvements to the
>> C/R functionality that were never brought over the v1.3/v1.4 series.
>>
>>>
>>> I was trying to use ssh instead of rsh but I was impossible. By
>>> default it should use ssh and if it finds a problem, it will use
>>> rsh. It seems that ssh doesn't work because always use rsh.
>>> If I change this MCA parameter, It still uses rsh.
>>> If I set OMPI_MCA_plm_rsh_disable_qrsh variable to 1, It try to use
>>> ssh and doesn't works. I got --> "bash: orted: command not found"
>>> and the mpi process dies.
>>> The command which try to execute is the following and I haven't
>>> found yet the reason why this command doesn't found orted because I
>>> set the /etc/bashrc in order to get always the right path and I have
>>> the right path into my application. (see ERROR4).
>>
>> This seems like an SGE specific issue, so a bit out of my domain.
>> Maybe others have suggestions here.
>>
>> -- Josh
>>
>>>
>>>
>>> Many thanks!,
>>> Sergio
>>>
>>> P.S. Sorry about these long emails. I just try to show you useful
>>> information to identify my problems.
>>>
>>>
>>> ERROR 1
>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>
>>> > [sdiaz_at_compute-3-18 ~]$ ompi-restart ompi_global_snapshot_28454.ckpt
>>> >
>>> --------------------------------------------------------------------------
>>>
>>> > Error: Unable to obtain the proper restart command to restart from
>>> the
>>> > checkpoint file (opal_snapshot_0.ckpt). Returned -1.
>>> >
>>> >
>>> --------------------------------------------------------------------------
>>>
>>> >
>>> --------------------------------------------------------------------------
>>>
>>> > Error: Unable to obtain the proper restart command to restart from
>>> the
>>> > checkpoint file (opal_snapshot_1.ckpt). Returned -1.
>>> >
>>> >
>>> --------------------------------------------------------------------------
>>>
>>> > [compute-3-18:28792] *** Process received signal ***
>>> > [compute-3-18:28792] Signal: Segmentation fault (11)
>>> > [compute-3-18:28792] Signal code: (128)
>>> > [compute-3-18:28792] Failing at address: (nil)
>>> > [compute-3-18:28792] [ 0] /lib64/tls/libpthread.so.0 [0x33bbf0c430]
>>> > [compute-3-18:28792] [ 1] /lib64/tls/libc.so.6(__libc_free+0x25)
>>> [0x33bb669135]
>>> > [compute-3-18:28792] [ 2]
>>> /opt/cesga/openmpi-1.3.3/lib/libopen-pal.so.0(opal_argv_free+0x2e)
>>> [0x2a95586658]
>>> > [compute-3-18:28792] [ 3]
>>> /opt/cesga/openmpi-1.3.3/lib/libopen-pal.so.0(opal_event_fini+0x1e)
>>> [0x2a9557906e]
>>> > [compute-3-18:28792] [ 4]
>>> /opt/cesga/openmpi-1.3.3/lib/libopen-pal.so.0(opal_finalize+0x36)
>>> [0x2a9556bcfa]
>>> > [compute-3-18:28792] [ 5] opal-restart [0x40312a]
>>> > [compute-3-18:28792] [ 6]
>>> /lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x33bb61c3fb]
>>> > [compute-3-18:28792] [ 7] opal-restart [0x40272a]
>>> > [compute-3-18:28792] *** End of error message ***
>>> > [compute-3-18:28793] *** Process received signal ***
>>> > [compute-3-18:28793] Signal: Segmentation fault (11)
>>> > [compute-3-18:28793] Signal code: (128)
>>> > [compute-3-18:28793] Failing at address: (nil)
>>> > [compute-3-18:28793] [ 0] /lib64/tls/libpthread.so.0 [0x33bbf0c430]
>>> > [compute-3-18:28793] [ 1] /lib64/tls/libc.so.6(__libc_free+0x25)
>>> [0x33bb669135]
>>> > [compute-3-18:28793] [ 2]
>>> /opt/cesga/openmpi-1.3.3/lib/libopen-pal.so.0(opal_argv_free+0x2e)
>>> [0x2a95586658]
>>> > [compute-3-18:28793] [ 3]
>>> /opt/cesga/openmpi-1.3.3/lib/libopen-pal.so.0(opal_event_fini+0x1e)
>>> [0x2a9557906e]
>>> > [compute-3-18:28793] [ 4]
>>> /opt/cesga/openmpi-1.3.3/lib/libopen-pal.so.0(opal_finalize+0x36)
>>> [0x2a9556bcfa]
>>> > [compute-3-18:28793] [ 5] opal-restart [0x40312a]
>>> > [compute-3-18:28793] [ 6]
>>> /lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x33bb61c3fb]
>>> > [compute-3-18:28793] [ 7] opal-restart [0x40272a]
>>> > [compute-3-18:28793] *** End of error message ***
>>> >
>>> --------------------------------------------------------------------------
>>>
>>> > mpirun noticed that process rank 0 with PID 28792 on node
>>> compute-3-18.local exited on signal 11 (Segmentation fault).
>>> >
>>> --------------------------------------------------------------------------
>>>
>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>
>>>
>>>
>>> ERROR 2
>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>
>>> > [sdiaz_at_compute-3-18 ~]$ ompi-restart -v
>>> ompi_global_snapshot_28454.ckpt
>>> >[compute-3-18.local:28941] Checking for the existence of
>>> (/home/cesga/sdiaz/ompi_global_snapshot_28454.ckpt)
>>> > [compute-3-18.local:28941] Restarting from file
>>> (ompi_global_snapshot_28454.ckpt)
>>> > [compute-3-18.local:28941] Exec in self
>>> > .......
>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>
>>>
>>>
>>> ERROR3
>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>
>>> >[sdiaz_at_compute-3-18 ~]$ ompi_info --all|grep "plm_rsh_agent"
>>> > How many plm_rsh_agent instances to invoke concurrently
>>> (must be > 0)
>>> > MCA plm: parameter "plm_rsh_agent" (current value: "ssh :
>>> rsh", data source: default value, synonyms: pls_rsh_agent)
>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>
>>>
>>> ERROR4
>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>
>>> >/usr/bin/ssh -x compute-3-17.local orted --debug-daemons -mca ess
>>> env -mca orte_ess_jobid 2152464384 -mca orte_ess_vpid 1 -mca
>>> orte_ess_num_procs 2 --hnp-uri
>>> >"2152464384.0;tcp://192.168.4.143:59176" -mca
>>> mca_base_param_file_prefix ft-enable-cr -mca
>>> mca_base_param_file_path
>>> >/opt/cesga/openmpi-1.3.3/share/openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz/mpi_test
>>> -mca mca_base_param_file_path_force /home_no_usc/cesga/sdiaz/mpi_test
>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Josh Hursey escribió:
>>>>
>>>> On Nov 9, 2009, at 5:33 AM, Sergio Díaz wrote:
>>>>
>>>>
>>>>> Hi Josh,
>>>>>
>>>>> The OpenMPI version is 1.3.3.
>>>>>
>>>>> The command ompi-ps doesn't work.
>>>>>
>>>>> [root_at_compute-3-18 ~]# ompi-ps -j 2726959 -p 16241
>>>>> [root_at_compute-3-18 ~]# ompi-ps -v -j 2726959 -p 16241
>>>>> [compute-3-18.local:16254] orte_ps: Acquiring list of HNPs and
>>>>> setting contact info into RML...
>>>>> [root_at_compute-3-18 ~]# ompi-ps -v -j 2726959
>>>>> [compute-3-18.local:16255] orte_ps: Acquiring list of HNPs and
>>>>> setting contact info into RML...
>>>>>
>>>>> [root_at_compute-3-18 ~]# ps uaxf | grep sdiaz
>>>>> root 16260 0.0 0.0 51084 680 pts/0 S+ 13:38
>>>>> 0:00 \_ grep sdiaz
>>>>> sdiaz 16203 0.0 0.0 53164 1220 ? Ss 13:37
>>>>> 0:00 \_ -bash
>>>>> /opt/cesga/sge62/default/spool/compute-3-18/job_scripts/2726959
>>>>> sdiaz 16241 0.0 0.0 41028 2480 ? S 13:37
>>>>> 0:00 \_ mpirun -np 2 -am ft-enable-cr ./pi3
>>>>> sdiaz 16242 0.0 0.0 36484 1840 ? Sl 13:37
>>>>> 0:00 \_ /opt/cesga/sge62/bin/lx24-x86/qrsh -inherit
>>>>> -nostdin -V compute-3-17.local orted -mca ess env -mca
>>>>> orte_ess_jobid 2769879040 -mca orte_ess_vpid 1 -mca
>>>>> orte_ess_num_procs 2 --hnp-uri
>>>>> "2769879040.0;tcp://192.168.4.143:57010" -mca
>>>>> mca_base_param_file_prefix ft-enable-cr -mca
>>>>> mca_base_param_file_path
>>>>> /opt/cesga/openmpi-1.3.3/share/openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz/mpi_test
>>>>> -mca mca_base_param_file_path_force /home_no_usc/cesga/sdiaz/mpi_test
>>>>> sdiaz 16245 0.1 0.0 99464 4616 ? Sl 13:37
>>>>> 0:00 \_ ./pi3
>>>>>
>>>>> [root_at_compute-3-18 ~]# ompi-ps -n c3-18
>>>>> [root_at_compute-3-18 ~]# ompi-ps -n compute-3-18
>>>>> [root_at_compute-3-18 ~]# ompi-ps -n
>>>>>
>>>>> There is not directory on the /tmp of the node. However, if the
>>>>> application is run without SGE, the directory is created
>>>>>
>>>> This may be the core of the problem. ompi-ps and other command line
>>>> tools (e.g., ompi-checkpoint) look for the Open MPI session
>>>> directory in /tmp in order to find the connection information to
>>>> connect to the mpirun process (internally called the HNP or Head
>>>> Node Process).
>>>>
>>>> Can you change the location of the temporary directory in SGE? The
>>>> temporary directory is usually set via an environment variable
>>>> (e.g., TMPDIR, or TMP). So removing the environment variable or
>>>> setting it to /tmp might help.
>>>>
>>>>
>>>>
>>>>> but if I do ompi-ps -j MPIRUN_PID, it seems hanged and I interrupt
>>>>> it. Does it take long time?
>>>>>
>>>> It should not take a long time. It is just querying the mpirun
>>>> process for state information.
>>>>
>>>>
>>>>> what means the option -j of ompi-ps command? isn't it related to a
>>>>> batch system(like sge, condor...), is it?
>>>>>
>>>> The '-j' option allows the user to specify the Open MPI jobid. This
>>>> is completely different than the jobid provided by the batch
>>>> system. In general, users should not need to specify the -j option.
>>>> It is useful when you have multiple Open MPI jobs, and want a
>>>> summary of just one of them.
>>>>
>>>>
>>>>> Thanks for the ticket. I will follow it.
>>>>>
>>>>> Talking with Alan, I realized that there are few transport
>>>>> protocols that are supported. And maybe it is the problem.
>>>>> Currently, SGE is using qrsh to expand mpi process. I can change
>>>>> this protocol and use ssh. So, I'm going to test it this afternoon
>>>>> and I will comment to you the results.
>>>>>
>>>> Try 'ssh' and see if that helps. I suspect the problem is with the
>>>> session directory location though.
>>>>
>>>>
>>>>> Regards,
>>>>> Sergio
>>>>>
>>>>>
>>>>> Josh Hursey escribió:
>>>>>
>>>>>> On Oct 28, 2009, at 7:41 AM, Sergio Díaz wrote:
>>>>>>
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I have achieved the checkpoint of an easy program without SGE.
>>>>>>> Now, I'm trying to do the integration openmpi+sge but I have
>>>>>>> some problems... When I try to do checkpoint of the mpirun PID,
>>>>>>> I got an error similar to the error gotten when the PID doesn't
>>>>>>> exit. The example below.
>>>>>>>
>>>>>> I do not have any experience with the SGE environment, so I
>>>>>> suspect that there may something 'special' about the environment
>>>>>> that is tripping up the ompi-checkpoint tool.
>>>>>>
>>>>>> First of all, what version of Open MPI are you using?
>>>>>>
>>>>>> Somethings to check:
>>>>>> - Does 'ompi-ps' work when your application is running?
>>>>>> - Is there an /tmp/openmpi-sessions-* directory on the node
>>>>>> where mpirun is currently running? This directory contains
>>>>>> information on how to connect to the mpirun process from an
>>>>>> external tool, if it's missing then this could be the cause of
>>>>>> the problem.
>>>>>>
>>>>>>
>>>>>>> Any ideas?
>>>>>>> Somebody have a script to do it automatic with SGE?. For example
>>>>>>> I have one to do checkpoint each X seconds with BLCR and non-mpi
>>>>>>> jobs. It is launched by SGE if you have configured the queue and
>>>>>>> the ckpt environment.
>>>>>>>
>>>>>> I do not know of any integration of the Open MPI checkpointing
>>>>>> work with SGE at the moment.
>>>>>>
>>>>>> As far as time triggered checkpointing, I have a feature ticket
>>>>>> open about this:
>>>>>> https://svn.open-mpi.org/trac/ompi/ticket/1961
>>>>>>
>>>>>> It is not available yet, but in the works.
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Is it possible choose the name of the ckpt folder when you do
>>>>>>> the ompi-checkpoint? I can't find the option to do it.
>>>>>>>
>>>>>> Not at this time. Though I could see it as a useful feature, and
>>>>>> shouldn't be too hard to implement. I filed a ticket if you want
>>>>>> to follow the progress:
>>>>>> https://svn.open-mpi.org/trac/ompi/ticket/2098
>>>>>>
>>>>>> -- Josh
>>>>>>
>>>>>>> Regards,
>>>>>>> Sergio
>>>>>>>
>>>>>>>
>>>>>>> --------------------------------
>>>>>>>
>>>>>>> [sdiaz_at_compute-3-17 ~]$ ps auxf
>>>>>>> ....
>>>>>>> root 20044 0.0 0.0 4468 1224 ? S 13:28 0:00
>>>>>>> \_ sge_shepherd-2645150 -bg
>>>>>>> sdiaz 20072 0.0 0.0 53172 1212 ? Ss 13:28
>>>>>>> 0:00 \_ -bash
>>>>>>> /opt/cesga/sge62/default/spool/compute-3-17/job_scripts/2645150
>>>>>>> sdiaz 20112 0.2 0.0 41028 2480 ? S 13:28
>>>>>>> 0:00 \_ mpirun -np 2 -am ft-enable-cr pi3
>>>>>>> sdiaz 20113 0.0 0.0 36484 1824 ? Sl 13:28
>>>>>>> 0:00 \_ /opt/cesga/sge62/bin/lx24-x86/qrsh -inherit
>>>>>>> -nostdin -V compute-3-18..........
>>>>>>> sdiaz 20116 1.2 0.0 99464 4616 ? Sl 13:28
>>>>>>> 0:00 \_ pi3
>>>>>>>
>>>>>>>
>>>>>>> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint 20112
>>>>>>> [compute-3-17.local:20124] HNP with PID 20112 Not found!
>>>>>>>
>>>>>>> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint -s 20112
>>>>>>> [compute-3-17.local:20135] HNP with PID 20112 Not found!
>>>>>>>
>>>>>>> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint -s --term 20112
>>>>>>> [compute-3-17.local:20136] HNP with PID 20112 Not found!
>>>>>>>
>>>>>>> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint --hnp-pid 20112
>>>>>>> --------------------------------------------------------------------------
>>>>>>>
>>>>>>> ompi-checkpoint PID_OF_MPIRUN
>>>>>>> Open MPI Checkpoint Tool
>>>>>>>
>>>>>>> -am <arg0> Aggregate MCA parameter set file list
>>>>>>> -gmca|--gmca <arg0> <arg1>
>>>>>>> Pass global MCA parameters that are
>>>>>>> applicable to
>>>>>>> all contexts (arg0 is the parameter
>>>>>>> name; arg1 is
>>>>>>> the parameter value)
>>>>>>> -h|--help This help message
>>>>>>> --hnp-jobid <arg0> This should be the jobid of the HNP whose
>>>>>>> applications you wish to checkpoint.
>>>>>>> --hnp-pid <arg0> This should be the pid of the mpirun whose
>>>>>>> applications you wish to checkpoint.
>>>>>>> -mca|--mca <arg0> <arg1>
>>>>>>> Pass context-specific MCA parameters;
>>>>>>> they are
>>>>>>> considered global if --gmca is not used
>>>>>>> and only
>>>>>>> one context is specified (arg0 is the
>>>>>>> parameter
>>>>>>> name; arg1 is the parameter value)
>>>>>>> -s|--status Display status messages describing the
>>>>>>> progression
>>>>>>> of the checkpoint
>>>>>>> --term Terminate the application after checkpoint
>>>>>>> -v|--verbose Be Verbose
>>>>>>> -w|--nowait Do not wait for the application to finish
>>>>>>> checkpointing before returning
>>>>>>>
>>>>>>> --------------------------------------------------------------------------
>>>>>>>
>>>>>>> [sdiaz_at_compute-3-17 ~]$ exit
>>>>>>> logout
>>>>>>> Connection to c3-17 closed.
>>>>>>> [sdiaz_at_svgd mpi_test]$ ssh c3-18
>>>>>>> Last login: Wed Oct 28 13:24:12 2009 from svgd.local
>>>>>>> -bash-3.00$ ps auxf |grep sdiaz
>>>>>>>
>>>>>>> sdiaz 14412 0.0 0.0 1888 560 ? Ss 13:28
>>>>>>> 0:00 \_ /opt/cesga/sge62/utilbin/lx24-x86/qrsh_starter
>>>>>>> /opt/cesga/sge62/default/spool/compute-3-18/active_jobs/2645150.1/1.compute-3-18
>>>>>>>
>>>>>>> sdiaz 14419 0.0 0.0 35728 2260 ? S 13:28
>>>>>>> 0:00 \_ orted -mca ess env -mca orte_ess_jobid
>>>>>>> 2295267328 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2
>>>>>>> --hnp-uri 2295267328.0;tcp://192.168.4.144:36596 -mca
>>>>>>> mca_base_param_file_prefix ft-enable-cr -mca
>>>>>>> mca_base_param_file_path
>>>>>>> /opt/cesga/openmpi-1.3.3/share/openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz/mpi_test
>>>>>>> -mca mca_base_param_file_path_force
>>>>>>> /home_no_usc/cesga/sdiaz/mpi_test
>>>>>>> sdiaz 14420 0.0 0.0 99452 4596 ? Sl 13:28
>>>>>>> 0:00 \_ pi3
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Sergio Díaz Montes
>>>>>>> Centro de Supercomputacion de Galicia
>>>>>>> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela
>>>>>>> (Spain)
>>>>>>> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
>>>>>>> email: sdiaz_at_[hidden] ; http://www.cesga.es/
>>>>>>> <image002.jpg>
>>>>>>> ------------------------------------------------
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>>
>>>>> --
>>>>> Sergio Díaz Montes
>>>>> Centro de Supercomputacion de Galicia
>>>>> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
>>>>> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
>>>>> email: sdiaz_at_[hidden] ; http://www.cesga.es/
>>>>> <image002.jpg>
>>>>> ------------------------------------------------
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>
>>>
>>> --
>>> Sergio Díaz Montes
>>> Centro de Supercomputacion de Galicia
>>> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
>>> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
>>> email: sdiaz_at_[hidden] ; http://www.cesga.es/
>>> <image002.jpg>
>>> ------------------------------------------------
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> --
> Sergio Díaz Montes
> Centro de Supercomputacion de Galicia
> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
> email: sdiaz_at_[hidden] ; http://www.cesga.es/
>
> ------------------------------------------------
> ------------------------------------------------------------------------
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Sergio Díaz Montes
Centro de Supercomputacion de Galicia
Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
email: sdiaz_at_[hidden] ; http://www.cesga.es/
------------------------------------------------



picture
image002.jpg