Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] checkpoint opempi-1.3.3+sge62
From: Reuti (reuti_at_[hidden])
Date: 2009-12-14 11:22:53


Hi,

Am 14.12.2009 um 17:05 schrieb Sergio Díaz:

> I got a successful checkpoint with a fresh installation and without
> use the trunk. I can't understand why it is working now and before
> I could do a successful restart... Maybe there was something wrong
> in the openmpi installation and then the metadata was created in a
> wrong way.
> I will test it more and also I will test the trunk.
>
> Regards,
> Sergio
>
> [sdiaz_at_compute-3-13 ~]$ ompi-restart -machinefile mpi_test/
> lanzar_pi3.sh.po3117822 ompi_global_snapshot_12554.ckpt
> tiempo 110
> Process 1 :
> compute-3-14.local
> of 2
> tiempo 110
> Process 0 :
> compute-3-13.local
> of 2
> tiempo 120
> Process 1 :
> compute-3-14.local
> of 2
> tiempo 120
> Process 0 :
> compute-3-13.local
> ...
> ...
>
> [sdiaz_at_compute-3-14 ~]$ ps auxf |grep sdiaz
> sdiaz 26273 0.0 0.0 34676 1668 ? Ss 15:58 0:00
> orted --daemonize -mca ess env -mca orte_ess_jobid 1739128832 -mca

in a Tight Integration into SGE the daemon should get the argument --
no-daemonize. Are you restarting a job on the command line, which ran
before under SGE's supervision?

-- Reuti

> orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri
> 1739128832.0;tcp://192.168.4.148:45551 -mca
> mca_base_param_file_prefix ft-enable-cr -mca
> mca_base_param_file_path /opt/cesga/openmpi-1.3.3_bis/share/openmpi/
> amca-param-sets:/home_no_usc/cesga/sdiaz -mca
> mca_base_param_file_path_force /home_no_usc/cesga/sdiaz
> sdiaz 26274 0.1 0.0 15984 504 ? Sl 15:58 0:00 \_
> cr_restart /home/cesga/sdiaz/ompi_global_snapshot_12554.ckpt/0/
> opal_snapshot_1.ckpt/ompi_blcr_context.26047
> sdiaz 26047 1.5 0.0 99460 3624 ? Sl 15:58 0:00
> \_ ./pi3
>
> [sdiaz_at_compute-3-13 ~]$ ps auxf |grep sdiaz
> root 12878 0.0 0.0 90260 3000 pts/0 S 15:55 0:00
> | \_ su - sdiaz
> sdiaz 12880 0.0 0.0 53432 1512 pts/0 S 15:55 0:00
> | \_ -bash
> sdiaz 13070 0.3 0.0 39988 2500 pts/0 S+ 15:58 0:00
> | \_ mpirun -am ft-enable-cr --default-hostfile
> mpi_test/lanzar_pi3.sh.po3117822 --app /home/cesga/sdiaz/
> ompi_global_snapshot_12554.ckpt/restart-appfile
> sdiaz 13073 0.0 0.0 15988 508 pts/0 Sl+ 15:58 0:00
> | \_ cr_restart /home/cesga/sdiaz/
> ompi_global_snapshot_12554.ckpt/0/opal_snapshot_0.ckpt/
> ompi_blcr_context.12558
> sdiaz 12558 0.2 0.0 99464 3616 pts/0 Sl+ 15:58 0:00
> | \_ ./pi3
>
>
> Sergio Díaz escribió:
>>
>> Hi Josh
>>
>> Here you go the file.
>>
>> I will try to apply the trunk but I think that I broke-up my
>> openmpi installation doing "something" and I don't know what :-( .
>> I was modifying the mca parameters...
>> When I send a job, the orted daemon expanded in the SLAVE host is
>> launched in a bucle till they spend all the reserved memory.
>> It is very strange so I will compile it again, I will reproduce
>> the bug and then I will test the trunk.
>>
>> Thanks a lot for the support and tickets opened.
>> Sergio
>>
>>
>> sdiaz 30279 0.0 0.0 1888 560 ? Ds 12:54
>> 0:00 \_ /opt/cesga/sge62/utilbin/lx24-x86/qrsh_starter /opt/
>> cesga/sge62/default/spool/compute
>> sdiaz 30286 0.0 0.0 52772 1188 ? D 12:54
>> 0:00 \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted -mca
>> ess env -mca orte_ess_jobid 219
>> sdiaz 30322 0.0 0.0 52772 1188 ? S 12:54
>> 0:00 \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted
>> sdiaz 30358 0.0 0.0 52772 1188 ? D 12:54
>> 0:00 \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted
>> sdiaz 30394 0.0 0.0 52772 1188 ? D 12:54
>> 0:00 \_ /bin/bash /opt/cesga/openmpi-1.3.3/
>> bin/orted
>> sdiaz 30430 0.0 0.0 52772 1188 ? D 12:54
>> 0:00 \_ /bin/bash /opt/cesga/
>> openmpi-1.3.3/bin/orted
>> sdiaz 30466 0.0 0.0 52772 1188 ? D 12:54
>> 0:00 \_ /bin/bash /opt/cesga/
>> openmpi-1.3.3/bin/orted
>> sdiaz 30502 0.0 0.0 52772 1188 ? D 12:54
>> 0:00 \_ /bin/bash /opt/cesga/
>> openmpi-1.3.3/bin/orted
>> sdiaz 30538 0.0 0.0 52772 1188 ? D 12:54
>> 0:00 \_ /bin/bash /opt/cesga/
>> openmpi-1.3.3/bin/orted
>> sdiaz 30574 0.0 0.0 52772 1188 ? D 12:54
>> 0:00 \_ /bin/bash /opt/
>> cesga/openmpi-1.3.3/bin/orted
>> ....
>>
>>
>>
>> Josh Hursey escribió:
>>>
>>>
>>> On Nov 12, 2009, at 10:54 AM, Sergio Díaz wrote:
>>>
>>>> Hi Josh,
>>>>
>>>> You were right. The main problem was the /tmp. SGE uses a
>>>> scratch directory in which the jobs have temporary files.
>>>> Setting TMPDIR to /tmp, checkpoint works!
>>>> However, when I try to restart it... I got the following error
>>>> (see ERROR1). Option -v agrees these lines (see ERRO2).
>>>
>>> It is concerning that ompi-restart is segfault'ing when it errors
>>> out. The error message is being generated between the launch of
>>> the opal-restart starter command and when we try to exec
>>> (cr_restart). Usually the failure is related to a corruption of
>>> the metadata stored in the checkpoint.
>>>
>>> Can you send me the file below:
>>> ompi_global_snapshot_28454.ckpt/0/opal_snapshot_0.ckpt/
>>> snapshot_meta.data
>>>
>>> I was able to reproduce the segv (at least I think it is the same
>>> one). We failed to check the validity of a string when we parse
>>> the metadata. I committed a fix to the trunk in r22290, and
>>> requested that the fix be moved to the v1.4 and v1.5 branches. If
>>> you are interested in seeing when they get applied you can follow
>>> the following tickets:
>>> https://svn.open-mpi.org/trac/ompi/ticket/2140
>>> https://svn.open-mpi.org/trac/ompi/ticket/2141
>>>
>>> Can you try the trunk to see if the problem goes away? The
>>> development trunk and v1.5 series have a bunch of improvements to
>>> the C/R functionality that were never brought over the v1.3/v1.4
>>> series.
>>>
>>>>
>>>> I was trying to use ssh instead of rsh but I was impossible. By
>>>> default it should use ssh and if it finds a problem, it will use
>>>> rsh. It seems that ssh doesn't work because always use rsh.
>>>> If I change this MCA parameter, It still uses rsh.
>>>> If I set OMPI_MCA_plm_rsh_disable_qrsh variable to 1, It try to
>>>> use ssh and doesn't works. I got --> "bash: orted: command not
>>>> found" and the mpi process dies.
>>>> The command which try to execute is the following and I haven't
>>>> found yet the reason why this command doesn't found orted
>>>> because I set the /etc/bashrc in order to get always the right
>>>> path and I have the right path into my application. (see ERROR4).
>>>
>>> This seems like an SGE specific issue, so a bit out of my domain.
>>> Maybe others have suggestions here.
>>>
>>> -- Josh
>>>
>>>>
>>>>
>>>> Many thanks!,
>>>> Sergio
>>>>
>>>> P.S. Sorry about these long emails. I just try to show you
>>>> useful information to identify my problems.
>>>>
>>>>
>>>> ERROR 1
>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>
>>>> > [sdiaz_at_compute-3-18 ~]$ ompi-restart
>>>> ompi_global_snapshot_28454.ckpt
>>>> >
>>>> -------------------------------------------------------------------
>>>> -------
>>>> > Error: Unable to obtain the proper restart command to restart
>>>> from the
>>>> > checkpoint file (opal_snapshot_0.ckpt). Returned -1.
>>>> >
>>>> >
>>>> -------------------------------------------------------------------
>>>> -------
>>>> >
>>>> -------------------------------------------------------------------
>>>> -------
>>>> > Error: Unable to obtain the proper restart command to restart
>>>> from the
>>>> > checkpoint file (opal_snapshot_1.ckpt). Returned -1.
>>>> >
>>>> >
>>>> -------------------------------------------------------------------
>>>> -------
>>>> > [compute-3-18:28792] *** Process received signal ***
>>>> > [compute-3-18:28792] Signal: Segmentation fault (11)
>>>> > [compute-3-18:28792] Signal code: (128)
>>>> > [compute-3-18:28792] Failing at address: (nil)
>>>> > [compute-3-18:28792] [ 0] /lib64/tls/libpthread.so.0
>>>> [0x33bbf0c430]
>>>> > [compute-3-18:28792] [ 1] /lib64/tls/libc.so.6(__libc_free
>>>> +0x25) [0x33bb669135]
>>>> > [compute-3-18:28792] [ 2] /opt/cesga/openmpi-1.3.3/lib/libopen-
>>>> pal.so.0(opal_argv_free+0x2e) [0x2a95586658]
>>>> > [compute-3-18:28792] [ 3] /opt/cesga/openmpi-1.3.3/lib/libopen-
>>>> pal.so.0(opal_event_fini+0x1e) [0x2a9557906e]
>>>> > [compute-3-18:28792] [ 4] /opt/cesga/openmpi-1.3.3/lib/libopen-
>>>> pal.so.0(opal_finalize+0x36) [0x2a9556bcfa]
>>>> > [compute-3-18:28792] [ 5] opal-restart [0x40312a]
>>>> > [compute-3-18:28792] [ 6] /lib64/tls/libc.so.6
>>>> (__libc_start_main+0xdb) [0x33bb61c3fb]
>>>> > [compute-3-18:28792] [ 7] opal-restart [0x40272a]
>>>> > [compute-3-18:28792] *** End of error message ***
>>>> > [compute-3-18:28793] *** Process received signal ***
>>>> > [compute-3-18:28793] Signal: Segmentation fault (11)
>>>> > [compute-3-18:28793] Signal code: (128)
>>>> > [compute-3-18:28793] Failing at address: (nil)
>>>> > [compute-3-18:28793] [ 0] /lib64/tls/libpthread.so.0
>>>> [0x33bbf0c430]
>>>> > [compute-3-18:28793] [ 1] /lib64/tls/libc.so.6(__libc_free
>>>> +0x25) [0x33bb669135]
>>>> > [compute-3-18:28793] [ 2] /opt/cesga/openmpi-1.3.3/lib/libopen-
>>>> pal.so.0(opal_argv_free+0x2e) [0x2a95586658]
>>>> > [compute-3-18:28793] [ 3] /opt/cesga/openmpi-1.3.3/lib/libopen-
>>>> pal.so.0(opal_event_fini+0x1e) [0x2a9557906e]
>>>> > [compute-3-18:28793] [ 4] /opt/cesga/openmpi-1.3.3/lib/libopen-
>>>> pal.so.0(opal_finalize+0x36) [0x2a9556bcfa]
>>>> > [compute-3-18:28793] [ 5] opal-restart [0x40312a]
>>>> > [compute-3-18:28793] [ 6] /lib64/tls/libc.so.6
>>>> (__libc_start_main+0xdb) [0x33bb61c3fb]
>>>> > [compute-3-18:28793] [ 7] opal-restart [0x40272a]
>>>> > [compute-3-18:28793] *** End of error message ***
>>>> >
>>>> -------------------------------------------------------------------
>>>> -------
>>>> > mpirun noticed that process rank 0 with PID 28792 on node
>>>> compute-3-18.local exited on signal 11 (Segmentation fault).
>>>> >
>>>> -------------------------------------------------------------------
>>>> -------
>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>
>>>>
>>>>
>>>> ERROR 2
>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>
>>>> > [sdiaz_at_compute-3-18 ~]$ ompi-restart -v
>>>> ompi_global_snapshot_28454.ckpt
>>>> >[compute-3-18.local:28941] Checking for the existence of (/home/
>>>> cesga/sdiaz/ompi_global_snapshot_28454.ckpt)
>>>> > [compute-3-18.local:28941] Restarting from file
>>>> (ompi_global_snapshot_28454.ckpt)
>>>> > [compute-3-18.local:28941] Exec in self
>>>> > .......
>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>
>>>>
>>>>
>>>> ERROR3
>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>
>>>> >[sdiaz_at_compute-3-18 ~]$ ompi_info --all|grep "plm_rsh_agent"
>>>> > How many plm_rsh_agent instances to invoke
>>>> concurrently (must be > 0)
>>>> > MCA plm: parameter "plm_rsh_agent" (current value:
>>>> "ssh : rsh", data source: default value, synonyms: pls_rsh_agent)
>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>
>>>>
>>>> ERROR4
>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>
>>>> >/usr/bin/ssh -x compute-3-17.local orted --debug-daemons -mca
>>>> ess env -mca orte_ess_jobid 2152464384 -mca orte_ess_vpid 1 -mca
>>>> orte_ess_num_procs 2 --hnp-uri >"2152464384.0;tcp://
>>>> 192.168.4.143:59176" -mca mca_base_param_file_prefix ft-enable-
>>>> cr -mca mca_base_param_file_path >/opt/cesga/openmpi-1.3.3/share/
>>>> openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz/mpi_test -mca
>>>> mca_base_param_file_path_force /home_no_usc/cesga/sdiaz/mpi_test
>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Josh Hursey escribió:
>>>>>
>>>>> On Nov 9, 2009, at 5:33 AM, Sergio Díaz wrote:
>>>>>
>>>>>
>>>>>> Hi Josh,
>>>>>>
>>>>>> The OpenMPI version is 1.3.3.
>>>>>>
>>>>>> The command ompi-ps doesn't work.
>>>>>>
>>>>>> [root_at_compute-3-18 ~]# ompi-ps -j 2726959 -p 16241
>>>>>> [root_at_compute-3-18 ~]# ompi-ps -v -j 2726959 -p 16241
>>>>>> [compute-3-18.local:16254] orte_ps: Acquiring list of HNPs and
>>>>>> setting contact info into RML...
>>>>>> [root_at_compute-3-18 ~]# ompi-ps -v -j 2726959
>>>>>> [compute-3-18.local:16255] orte_ps: Acquiring list of HNPs and
>>>>>> setting contact info into RML...
>>>>>>
>>>>>> [root_at_compute-3-18 ~]# ps uaxf | grep sdiaz
>>>>>> root 16260 0.0 0.0 51084 680 pts/0 S+ 13:38
>>>>>> 0:00 \_ grep sdiaz
>>>>>> sdiaz 16203 0.0 0.0 53164 1220 ? Ss 13:37
>>>>>> 0:00 \_ -bash /opt/cesga/sge62/default/spool/compute-3-18/
>>>>>> job_scripts/2726959
>>>>>> sdiaz 16241 0.0 0.0 41028 2480 ? S 13:37
>>>>>> 0:00 \_ mpirun -np 2 -am ft-enable-cr ./pi3
>>>>>> sdiaz 16242 0.0 0.0 36484 1840 ? Sl 13:37
>>>>>> 0:00 \_ /opt/cesga/sge62/bin/lx24-x86/qrsh -
>>>>>> inherit -nostdin -V compute-3-17.local orted -mca ess env -
>>>>>> mca orte_ess_jobid 2769879040 -mca orte_ess_vpid 1 -mca
>>>>>> orte_ess_num_procs 2 --hnp-uri "2769879040.0;tcp://
>>>>>> 192.168.4.143:57010" -mca mca_base_param_file_prefix ft-enable-
>>>>>> cr -mca mca_base_param_file_path /opt/cesga/openmpi-1.3.3/
>>>>>> share/openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz/
>>>>>> mpi_test -mca mca_base_param_file_path_force /home_no_usc/
>>>>>> cesga/sdiaz/mpi_test
>>>>>> sdiaz 16245 0.1 0.0 99464 4616 ? Sl 13:37
>>>>>> 0:00 \_ ./pi3
>>>>>>
>>>>>> [root_at_compute-3-18 ~]# ompi-ps -n c3-18
>>>>>> [root_at_compute-3-18 ~]# ompi-ps -n compute-3-18
>>>>>> [root_at_compute-3-18 ~]# ompi-ps -n
>>>>>>
>>>>>> There is not directory on the /tmp of the node. However, if
>>>>>> the application is run without SGE, the directory is created
>>>>>>
>>>>> This may be the core of the problem. ompi-ps and other command
>>>>> line tools (e.g., ompi-checkpoint) look for the Open MPI
>>>>> session directory in /tmp in order to find the connection
>>>>> information to connect to the mpirun process (internally called
>>>>> the HNP or Head Node Process).
>>>>>
>>>>> Can you change the location of the temporary directory in SGE?
>>>>> The temporary directory is usually set via an environment
>>>>> variable (e.g., TMPDIR, or TMP). So removing the environment
>>>>> variable or setting it to /tmp might help.
>>>>>
>>>>>
>>>>>
>>>>>> but if I do ompi-ps -j MPIRUN_PID, it seems hanged and I
>>>>>> interrupt it. Does it take long time?
>>>>>>
>>>>> It should not take a long time. It is just querying the mpirun
>>>>> process for state information.
>>>>>
>>>>>
>>>>>> what means the option -j of ompi-ps command? isn't it related
>>>>>> to a batch system(like sge, condor...), is it?
>>>>>>
>>>>> The '-j' option allows the user to specify the Open MPI jobid.
>>>>> This is completely different than the jobid provided by the
>>>>> batch system. In general, users should not need to specify the -
>>>>> j option. It is useful when you have multiple Open MPI jobs,
>>>>> and want a summary of just one of them.
>>>>>
>>>>>
>>>>>> Thanks for the ticket. I will follow it.
>>>>>>
>>>>>> Talking with Alan, I realized that there are few transport
>>>>>> protocols that are supported. And maybe it is the problem.
>>>>>> Currently, SGE is using qrsh to expand mpi process. I can
>>>>>> change this protocol and use ssh. So, I'm going to test it
>>>>>> this afternoon and I will comment to you the results.
>>>>>>
>>>>> Try 'ssh' and see if that helps. I suspect the problem is with
>>>>> the session directory location though.
>>>>>
>>>>>
>>>>>> Regards,
>>>>>> Sergio
>>>>>>
>>>>>>
>>>>>> Josh Hursey escribió:
>>>>>>
>>>>>>> On Oct 28, 2009, at 7:41 AM, Sergio Díaz wrote:
>>>>>>>
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I have achieved the checkpoint of an easy program without
>>>>>>>> SGE. Now, I'm trying to do the integration openmpi+sge but I
>>>>>>>> have some problems... When I try to do checkpoint of the
>>>>>>>> mpirun PID, I got an error similar to the error gotten when
>>>>>>>> the PID doesn't exit. The example below.
>>>>>>>>
>>>>>>> I do not have any experience with the SGE environment, so I
>>>>>>> suspect that there may something 'special' about the
>>>>>>> environment that is tripping up the ompi-checkpoint tool.
>>>>>>>
>>>>>>> First of all, what version of Open MPI are you using?
>>>>>>>
>>>>>>> Somethings to check:
>>>>>>> - Does 'ompi-ps' work when your application is running?
>>>>>>> - Is there an /tmp/openmpi-sessions-* directory on the node
>>>>>>> where mpirun is currently running? This directory contains
>>>>>>> information on how to connect to the mpirun process from an
>>>>>>> external tool, if it's missing then this could be the cause
>>>>>>> of the problem.
>>>>>>>
>>>>>>>
>>>>>>>> Any ideas?
>>>>>>>> Somebody have a script to do it automatic with SGE?. For
>>>>>>>> example I have one to do checkpoint each X seconds with BLCR
>>>>>>>> and non-mpi jobs. It is launched by SGE if you have
>>>>>>>> configured the queue and the ckpt environment.
>>>>>>>>
>>>>>>> I do not know of any integration of the Open MPI
>>>>>>> checkpointing work with SGE at the moment.
>>>>>>>
>>>>>>> As far as time triggered checkpointing, I have a feature
>>>>>>> ticket open about this:
>>>>>>> https://svn.open-mpi.org/trac/ompi/ticket/1961
>>>>>>>
>>>>>>> It is not available yet, but in the works.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Is it possible choose the name of the ckpt folder when you
>>>>>>>> do the ompi-checkpoint? I can't find the option to do it.
>>>>>>>>
>>>>>>> Not at this time. Though I could see it as a useful feature,
>>>>>>> and shouldn't be too hard to implement. I filed a ticket if
>>>>>>> you want to follow the progress:
>>>>>>> https://svn.open-mpi.org/trac/ompi/ticket/2098
>>>>>>>
>>>>>>> -- Josh
>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Sergio
>>>>>>>>
>>>>>>>>
>>>>>>>> --------------------------------
>>>>>>>>
>>>>>>>> [sdiaz_at_compute-3-17 ~]$ ps auxf
>>>>>>>> ....
>>>>>>>> root 20044 0.0 0.0 4468 1224 ? S 13:28
>>>>>>>> 0:00 \_ sge_shepherd-2645150 -bg
>>>>>>>> sdiaz 20072 0.0 0.0 53172 1212 ? Ss 13:28
>>>>>>>> 0:00 \_ -bash /opt/cesga/sge62/default/spool/
>>>>>>>> compute-3-17/job_scripts/2645150
>>>>>>>> sdiaz 20112 0.2 0.0 41028 2480 ? S 13:28
>>>>>>>> 0:00 \_ mpirun -np 2 -am ft-enable-cr pi3
>>>>>>>> sdiaz 20113 0.0 0.0 36484 1824 ? Sl 13:28
>>>>>>>> 0:00 \_ /opt/cesga/sge62/bin/lx24-x86/qrsh -
>>>>>>>> inherit -nostdin -V compute-3-18..........
>>>>>>>> sdiaz 20116 1.2 0.0 99464 4616 ? Sl 13:28
>>>>>>>> 0:00 \_ pi3
>>>>>>>>
>>>>>>>>
>>>>>>>> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint 20112
>>>>>>>> [compute-3-17.local:20124] HNP with PID 20112 Not found!
>>>>>>>>
>>>>>>>> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint -s 20112
>>>>>>>> [compute-3-17.local:20135] HNP with PID 20112 Not found!
>>>>>>>>
>>>>>>>> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint -s --term 20112
>>>>>>>> [compute-3-17.local:20136] HNP with PID 20112 Not found!
>>>>>>>>
>>>>>>>> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint --hnp-pid 20112
>>>>>>>> ---------------------------------------------------------------
>>>>>>>> -----------
>>>>>>>> ompi-checkpoint PID_OF_MPIRUN
>>>>>>>> Open MPI Checkpoint Tool
>>>>>>>>
>>>>>>>> -am <arg0> Aggregate MCA parameter set file list
>>>>>>>> -gmca|--gmca <arg0> <arg1>
>>>>>>>> Pass global MCA parameters that are
>>>>>>>> applicable to
>>>>>>>> all contexts (arg0 is the parameter
>>>>>>>> name; arg1 is
>>>>>>>> the parameter value)
>>>>>>>> -h|--help This help message
>>>>>>>> --hnp-jobid <arg0> This should be the jobid of the HNP
>>>>>>>> whose
>>>>>>>> applications you wish to checkpoint.
>>>>>>>> --hnp-pid <arg0> This should be the pid of the
>>>>>>>> mpirun whose
>>>>>>>> applications you wish to checkpoint.
>>>>>>>> -mca|--mca <arg0> <arg1>
>>>>>>>> Pass context-specific MCA
>>>>>>>> parameters; they are
>>>>>>>> considered global if --gmca is not
>>>>>>>> used and only
>>>>>>>> one context is specified (arg0 is
>>>>>>>> the parameter
>>>>>>>> name; arg1 is the parameter value)
>>>>>>>> -s|--status Display status messages describing
>>>>>>>> the progression
>>>>>>>> of the checkpoint
>>>>>>>> --term Terminate the application after
>>>>>>>> checkpoint
>>>>>>>> -v|--verbose Be Verbose
>>>>>>>> -w|--nowait Do not wait for the application to
>>>>>>>> finish
>>>>>>>> checkpointing before returning
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------
>>>>>>>> -----------
>>>>>>>> [sdiaz_at_compute-3-17 ~]$ exit
>>>>>>>> logout
>>>>>>>> Connection to c3-17 closed.
>>>>>>>> [sdiaz_at_svgd mpi_test]$ ssh c3-18
>>>>>>>> Last login: Wed Oct 28 13:24:12 2009 from svgd.local
>>>>>>>> -bash-3.00$ ps auxf |grep sdiaz
>>>>>>>>
>>>>>>>> sdiaz 14412 0.0 0.0 1888 560 ? Ss 13:28
>>>>>>>> 0:00 \_ /opt/cesga/sge62/utilbin/lx24-x86/qrsh_starter /
>>>>>>>> opt/cesga/sge62/default/spool/compute-3-18/active_jobs/
>>>>>>>> 2645150.1/1.compute-3-18
>>>>>>>> sdiaz 14419 0.0 0.0 35728 2260 ? S 13:28
>>>>>>>> 0:00 \_ orted -mca ess env -mca orte_ess_jobid
>>>>>>>> 2295267328 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --
>>>>>>>> hnp-uri 2295267328.0;tcp://192.168.4.144:36596 -mca
>>>>>>>> mca_base_param_file_prefix ft-enable-cr -mca
>>>>>>>> mca_base_param_file_path /opt/cesga/openmpi-1.3.3/share/
>>>>>>>> openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz/mpi_test -
>>>>>>>> mca mca_base_param_file_path_force /home_no_usc/cesga/sdiaz/
>>>>>>>> mpi_test
>>>>>>>> sdiaz 14420 0.0 0.0 99452 4596 ? Sl 13:28
>>>>>>>> 0:00 \_ pi3
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Sergio Díaz Montes
>>>>>>>> Centro de Supercomputacion de Galicia
>>>>>>>> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela
>>>>>>>> (Spain)
>>>>>>>> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
>>>>>>>> email: sdiaz_at_[hidden] ; http://www.cesga.es/
>>>>>>>> <image002.jpg>
>>>>>>>> ------------------------------------------------
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>>
>>>>>> --
>>>>>> Sergio Díaz Montes
>>>>>> Centro de Supercomputacion de Galicia
>>>>>> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela
>>>>>> (Spain)
>>>>>> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
>>>>>> email: sdiaz_at_[hidden] ; http://www.cesga.es/
>>>>>> <image002.jpg>
>>>>>> ------------------------------------------------
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Sergio Díaz Montes
>>>> Centro de Supercomputacion de Galicia
>>>> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela
>>>> (Spain)
>>>> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
>>>> email: sdiaz_at_[hidden] ; http://www.cesga.es/
>>>> <image002.jpg>
>>>> ------------------------------------------------
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>> --
>> Sergio Díaz Montes
>> Centro de Supercomputacion de Galicia
>> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
>> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
>> email: sdiaz_at_[hidden] ; http://www.cesga.es/
>> <mime-attachment.jpeg>
>> ------------------------------------------------
>> _______________________________________________ users mailing list
>> users_at_[hidden] http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Sergio Díaz Montes
> Centro de Supercomputacion de Galicia
> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
> email: sdiaz_at_[hidden] ; http://www.cesga.es/
> <image002.jpg>
> ------------------------------------------------
> <mime-attachment.jpeg><image002.jpg>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users