Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] checkpoint opempi-1.3.3+sge62
From: Reuti (reuti_at_[hidden])
Date: 2009-12-14 15:25:00


Hi,

no, I never tried Open MPI's checkpointing. But there are two Howto's
from which you may get some ideas to integrate it with SGE:

http://gridengine.sunsource.net/howto/checkpointing.html
http://gridengine.sunsource.net/howto/APSTC-TB-2004-005.pdf (but Open
MPI's checkpointing seems more to be like Condor's, as you don't have
to deal with any process list on your own AFAIK)

Included is also an example to integrate SGE with the Condor
checkpointing library in standalone mode.

Purpose of the checkpointing interface can be to copy the files from
a local (checkpointing) directory on a node to a shared space like /
home/checkpoint (the $SGE_CKPT_DIR [I even greated a subdirectory
with the $JOB_ID therein in the examples]). Later on the files can be
copied to the (maybe different) nodes again (either in a queue prolog
or the job script) when the job restarts.

-- Reuti

Am 14.12.2009 um 18:25 schrieb Sergio Díaz:

> Hi Reuti,
>
> Yes, I sent a job with SGE and I checkpointed the mpirun process,
> by hand, entering into the mpi master node. Then I killed the job
> with qdel and after that I did the ompi-restart.
> I will try to integrate with SGE creating a ckpt environment but I
> think that it could be a bit difficult because:
> 1 - when I do checkpoint I can't specify a directory with
> a name like checkpoint_jobid
> 2 - I can't specify the scratch directory and I have to
> use the /tmp instead of SGE's scratch directory.
> 3 - I tried to restart the snapshot and it only works if
> I use the same machinefile. That is, If the job ran in the c3-13
> and c3-14, I have to restart the job using a machinefile with these
> two nodes.
>
> [sdiaz_at_svgd ~]$ ompi-restart -v -machinefile
> mpi_test/lanzar_pi3.sh.po3117822 ompi_global_snapshot_12554.ckpt
> [svgd.cesga.es:28836] Checking for the
> existence of (/home/cesga/sdiaz/ompi_global_snapshot_12554.ckpt)
> [svgd.cesga.es:28836] Restarting from file
> (ompi_global_snapshot_12554.ckpt)
> [svgd.cesga.es:28836] Exec in self
> tiempo 110
> Process 1 :
> compute-3-14.local
> of 2
> tiempo 110
> Process 0 :
> compute-3-13.local
> of 2
>
> ----------------------------------------------------------------------
> ----
> mpirun noticed that process rank 1 with PID
> 8477 on node compute-3-15 exited on signal 11 (Segmentation fault).
>
> ----------------------------------------------------------------------
> ----
>
> To solve problem 1, there is a feature opened by Josh. (https://
> svn.open-mpi.org/trac/ompi/ticket/2098)
> To solve problem 2, there is a thread in which is talked ([OMPI
> users] Changing location where checkpoints are saved) and also a
> bug opened by Josh. https://svn.open-mpi.org/trac/ompi/ticket/
> 2139 . I think that it could work... we will see.
> To solve problem 3, I didn't have time to search it. But if Josh or
> anyone have an idea... please tell to us :-)
>
> Reuti, Did you test it successfully? How do you solve these problems?
>
> Regards,
> Sergio
>
>
> Reuti escribió:
>>
>> Hi,
>>
>> Am 14.12.2009 um 17:05 schrieb Sergio Díaz:
>>
>>> I got a successful checkpoint with a fresh installation and
>>> without use the trunk. I can't understand why it is working now
>>> and before I could do a successful restart... Maybe there was
>>> something wrong in the openmpi installation and then the metadata
>>> was created in a wrong way.
>>> I will test it more and also I will test the trunk.
>>>
>>> Regards,
>>> Sergio
>>>
>>> [sdiaz_at_compute-3-13 ~]$ ompi-restart -machinefile mpi_test/
>>> lanzar_pi3.sh.po3117822 ompi_global_snapshot_12554.ckpt
>>> tiempo 110
>>> Process 1 :
>>> compute-3-14.local
>>> of 2
>>> tiempo 110
>>> Process 0 :
>>> compute-3-13.local
>>> of 2
>>> tiempo 120
>>> Process 1 :
>>> compute-3-14.local
>>> of 2
>>> tiempo 120
>>> Process 0 :
>>> compute-3-13.local
>>> ...
>>> ...
>>>
>>> [sdiaz_at_compute-3-14 ~]$ ps auxf |grep sdiaz
>>> sdiaz 26273 0.0 0.0 34676 1668 ? Ss 15:58 0:00
>>> orted --daemonize -mca ess env -mca orte_ess_jobid 1739128832 -mca
>>
>> in a Tight Integration into SGE the daemon should get the argument
>> --no-daemonize. Are you restarting a job on the command line,
>> which ran before under SGE's supervision?
>>
>> -- Reuti
>>
>>
>>> orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri
>>> 1739128832.0;tcp://192.168.4.148:45551 -mca
>>> mca_base_param_file_prefix ft-enable-cr -mca
>>> mca_base_param_file_path /opt/cesga/openmpi-1.3.3_bis/share/
>>> openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz -mca
>>> mca_base_param_file_path_force /home_no_usc/cesga/sdiaz
>>> sdiaz 26274 0.1 0.0 15984 504 ? Sl 15:58 0:00
>>> \_ cr_restart /home/cesga/sdiaz/ompi_global_snapshot_12554.ckpt/0/
>>> opal_snapshot_1.ckpt/ompi_blcr_context.26047
>>> sdiaz 26047 1.5 0.0 99460 3624 ? Sl 15:58
>>> 0:00 \_ ./pi3
>>>
>>> [sdiaz_at_compute-3-13 ~]$ ps auxf |grep sdiaz
>>> root 12878 0.0 0.0 90260 3000 pts/0 S 15:55 0:00
>>> | \_ su - sdiaz
>>> sdiaz 12880 0.0 0.0 53432 1512 pts/0 S 15:55 0:00
>>> | \_ -bash
>>> sdiaz 13070 0.3 0.0 39988 2500 pts/0 S+ 15:58 0:00
>>> | \_ mpirun -am ft-enable-cr --default-hostfile
>>> mpi_test/lanzar_pi3.sh.po3117822 --app /home/cesga/sdiaz/
>>> ompi_global_snapshot_12554.ckpt/restart-appfile
>>> sdiaz 13073 0.0 0.0 15988 508 pts/0 Sl+ 15:58 0:00
>>> | \_ cr_restart /home/cesga/sdiaz/
>>> ompi_global_snapshot_12554.ckpt/0/opal_snapshot_0.ckpt/
>>> ompi_blcr_context.12558
>>> sdiaz 12558 0.2 0.0 99464 3616 pts/0 Sl+ 15:58 0:00
>>> | \_ ./pi3
>>>
>>>
>>> Sergio Díaz escribió:
>>>>
>>>> Hi Josh
>>>>
>>>> Here you go the file.
>>>>
>>>> I will try to apply the trunk but I think that I broke-up my
>>>> openmpi installation doing "something" and I don't know what :-
>>>> ( . I was modifying the mca parameters...
>>>> When I send a job, the orted daemon expanded in the SLAVE host
>>>> is launched in a bucle till they spend all the reserved memory.
>>>> It is very strange so I will compile it again, I will reproduce
>>>> the bug and then I will test the trunk.
>>>>
>>>> Thanks a lot for the support and tickets opened.
>>>> Sergio
>>>>
>>>>
>>>> sdiaz 30279 0.0 0.0 1888 560 ? Ds 12:54
>>>> 0:00 \_ /opt/cesga/sge62/utilbin/lx24-x86/qrsh_starter /opt/
>>>> cesga/sge62/default/spool/compute
>>>> sdiaz 30286 0.0 0.0 52772 1188 ? D 12:54
>>>> 0:00 \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted -
>>>> mca ess env -mca orte_ess_jobid 219
>>>> sdiaz 30322 0.0 0.0 52772 1188 ? S 12:54
>>>> 0:00 \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted
>>>> sdiaz 30358 0.0 0.0 52772 1188 ? D 12:54
>>>> 0:00 \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/
>>>> orted
>>>> sdiaz 30394 0.0 0.0 52772 1188 ? D 12:54
>>>> 0:00 \_ /bin/bash /opt/cesga/openmpi-1.3.3/
>>>> bin/orted
>>>> sdiaz 30430 0.0 0.0 52772 1188 ? D 12:54
>>>> 0:00 \_ /bin/bash /opt/cesga/
>>>> openmpi-1.3.3/bin/orted
>>>> sdiaz 30466 0.0 0.0 52772 1188 ? D 12:54
>>>> 0:00 \_ /bin/bash /opt/cesga/
>>>> openmpi-1.3.3/bin/orted
>>>> sdiaz 30502 0.0 0.0 52772 1188 ? D 12:54
>>>> 0:00 \_ /bin/bash /opt/cesga/
>>>> openmpi-1.3.3/bin/orted
>>>> sdiaz 30538 0.0 0.0 52772 1188 ? D 12:54
>>>> 0:00 \_ /bin/bash /opt/
>>>> cesga/openmpi-1.3.3/bin/orted
>>>> sdiaz 30574 0.0 0.0 52772 1188 ? D 12:54
>>>> 0:00 \_ /bin/bash /opt/
>>>> cesga/openmpi-1.3.3/bin/orted
>>>> ....
>>>>
>>>>
>>>>
>>>> Josh Hursey escribió:
>>>>>
>>>>>
>>>>> On Nov 12, 2009, at 10:54 AM, Sergio Díaz wrote:
>>>>>
>>>>>> Hi Josh,
>>>>>>
>>>>>> You were right. The main problem was the /tmp. SGE uses a
>>>>>> scratch directory in which the jobs have temporary files.
>>>>>> Setting TMPDIR to /tmp, checkpoint works!
>>>>>> However, when I try to restart it... I got the following error
>>>>>> (see ERROR1). Option -v agrees these lines (see ERRO2).
>>>>>
>>>>> It is concerning that ompi-restart is segfault'ing when it
>>>>> errors out. The error message is being generated between the
>>>>> launch of the opal-restart starter command and when we try to
>>>>> exec(cr_restart). Usually the failure is related to a
>>>>> corruption of the metadata stored in the checkpoint.
>>>>>
>>>>> Can you send me the file below:
>>>>> ompi_global_snapshot_28454.ckpt/0/opal_snapshot_0.ckpt/
>>>>> snapshot_meta.data
>>>>>
>>>>> I was able to reproduce the segv (at least I think it is the
>>>>> same one). We failed to check the validity of a string when we
>>>>> parse the metadata. I committed a fix to the trunk in r22290,
>>>>> and requested that the fix be moved to the v1.4 and v1.5
>>>>> branches. If you are interested in seeing when they get applied
>>>>> you can follow the following tickets:
>>>>> https://svn.open-mpi.org/trac/ompi/ticket/2140
>>>>> https://svn.open-mpi.org/trac/ompi/ticket/2141
>>>>>
>>>>> Can you try the trunk to see if the problem goes away? The
>>>>> development trunk and v1.5 series have a bunch of improvements
>>>>> to the C/R functionality that were never brought over the v1.3/
>>>>> v1.4 series.
>>>>>
>>>>>>
>>>>>> I was trying to use ssh instead of rsh but I was impossible.
>>>>>> By default it should use ssh and if it finds a problem, it
>>>>>> will use rsh. It seems that ssh doesn't work because always
>>>>>> use rsh.
>>>>>> If I change this MCA parameter, It still uses rsh.
>>>>>> If I set OMPI_MCA_plm_rsh_disable_qrsh variable to 1, It try
>>>>>> to use ssh and doesn't works. I got --> "bash: orted: command
>>>>>> not found" and the mpi process dies.
>>>>>> The command which try to execute is the following and I
>>>>>> haven't found yet the reason why this command doesn't found
>>>>>> orted because I set the /etc/bashrc in order to get always the
>>>>>> right path and I have the right path into my application. (see
>>>>>> ERROR4).
>>>>>
>>>>> This seems like an SGE specific issue, so a bit out of my
>>>>> domain. Maybe others have suggestions here.
>>>>>
>>>>> -- Josh
>>>>>>
>>>>>>
>>>>>> Many thanks!,
>>>>>> Sergio
>>>>>>
>>>>>> P.S. Sorry about these long emails. I just try to show you
>>>>>> useful information to identify my problems.
>>>>>>
>>>>>>
>>>>>> ERROR 1
>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>> >>>>>>>>>>>>>>>>
>>>>>> > [sdiaz_at_compute-3-18 ~]$ ompi-restart
>>>>>> ompi_global_snapshot_28454.ckpt
>>>>>> >
>>>>>> -----------------------------------------------------------------
>>>>>> ---------
>>>>>> > Error: Unable to obtain the proper restart command to
>>>>>> restart from the
>>>>>> > checkpoint file (opal_snapshot_0.ckpt). Returned -1.
>>>>>> >
>>>>>> >
>>>>>> -----------------------------------------------------------------
>>>>>> ---------
>>>>>> >
>>>>>> -----------------------------------------------------------------
>>>>>> ---------
>>>>>> > Error: Unable to obtain the proper restart command to
>>>>>> restart from the
>>>>>> > checkpoint file (opal_snapshot_1.ckpt). Returned -1.
>>>>>> >
>>>>>> >
>>>>>> -----------------------------------------------------------------
>>>>>> ---------
>>>>>> > [compute-3-18:28792] *** Process received signal ***
>>>>>> > [compute-3-18:28792] Signal: Segmentation fault (11)
>>>>>> > [compute-3-18:28792] Signal code: (128)
>>>>>> > [compute-3-18:28792] Failing at address: (nil)
>>>>>> > [compute-3-18:28792] [ 0] /lib64/tls/libpthread.so.0
>>>>>> [0x33bbf0c430]
>>>>>> > [compute-3-18:28792] [ 1] /lib64/tls/libc.so.6(__libc_free
>>>>>> +0x25) [0x33bb669135]
>>>>>> > [compute-3-18:28792] [ 2] /opt/cesga/openmpi-1.3.3/lib/
>>>>>> libopen-pal.so.0(opal_argv_free+0x2e) [0x2a95586658]
>>>>>> > [compute-3-18:28792] [ 3] /opt/cesga/openmpi-1.3.3/lib/
>>>>>> libopen-pal.so.0(opal_event_fini+0x1e) [0x2a9557906e]
>>>>>> > [compute-3-18:28792] [ 4] /opt/cesga/openmpi-1.3.3/lib/
>>>>>> libopen-pal.so.0(opal_finalize+0x36) [0x2a9556bcfa]
>>>>>> > [compute-3-18:28792] [ 5] opal-restart [0x40312a]
>>>>>> > [compute-3-18:28792] [ 6] /lib64/tls/libc.so.6
>>>>>> (__libc_start_main+0xdb) [0x33bb61c3fb]
>>>>>> > [compute-3-18:28792] [ 7] opal-restart [0x40272a]
>>>>>> > [compute-3-18:28792] *** End of error message ***
>>>>>> > [compute-3-18:28793] *** Process received signal ***
>>>>>> > [compute-3-18:28793] Signal: Segmentation fault (11)
>>>>>> > [compute-3-18:28793] Signal code: (128)
>>>>>> > [compute-3-18:28793] Failing at address: (nil)
>>>>>> > [compute-3-18:28793] [ 0] /lib64/tls/libpthread.so.0
>>>>>> [0x33bbf0c430]
>>>>>> > [compute-3-18:28793] [ 1] /lib64/tls/libc.so.6(__libc_free
>>>>>> +0x25) [0x33bb669135]
>>>>>> > [compute-3-18:28793] [ 2] /opt/cesga/openmpi-1.3.3/lib/
>>>>>> libopen-pal.so.0(opal_argv_free+0x2e) [0x2a95586658]
>>>>>> > [compute-3-18:28793] [ 3] /opt/cesga/openmpi-1.3.3/lib/
>>>>>> libopen-pal.so.0(opal_event_fini+0x1e) [0x2a9557906e]
>>>>>> > [compute-3-18:28793] [ 4] /opt/cesga/openmpi-1.3.3/lib/
>>>>>> libopen-pal.so.0(opal_finalize+0x36) [0x2a9556bcfa]
>>>>>> > [compute-3-18:28793] [ 5] opal-restart [0x40312a]
>>>>>> > [compute-3-18:28793] [ 6] /lib64/tls/libc.so.6
>>>>>> (__libc_start_main+0xdb) [0x33bb61c3fb]
>>>>>> > [compute-3-18:28793] [ 7] opal-restart [0x40272a]
>>>>>> > [compute-3-18:28793] *** End of error message ***
>>>>>> >
>>>>>> -----------------------------------------------------------------
>>>>>> ---------
>>>>>> > mpirun noticed that process rank 0 with PID 28792 on node
>>>>>> compute-3-18.local exited on signal 11 (Segmentation fault).
>>>>>> >
>>>>>> -----------------------------------------------------------------
>>>>>> ---------
>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>> >>>>>>>>>>>>>>>>>>>>
>>>>>>
>>>>>>
>>>>>> ERROR 2
>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>> >>>>>>>>>>>>>>>
>>>>>> > [sdiaz_at_compute-3-18 ~]$ ompi-restart -v
>>>>>> ompi_global_snapshot_28454.ckpt
>>>>>> >[compute-3-18.local:28941] Checking for the existence of (/
>>>>>> home/cesga/sdiaz/ompi_global_snapshot_28454.ckpt)
>>>>>> > [compute-3-18.local:28941] Restarting from file
>>>>>> (ompi_global_snapshot_28454.ckpt)
>>>>>> > [compute-3-18.local:28941] Exec in self
>>>>>> > .......
>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>> >>>>>>>>>>>>>>>
>>>>>>
>>>>>>
>>>>>> ERROR3
>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>> >>>>>>>>>>>>>>>
>>>>>> >[sdiaz_at_compute-3-18 ~]$ ompi_info --all|grep "plm_rsh_agent"
>>>>>> > How many plm_rsh_agent instances to invoke
>>>>>> concurrently (must be > 0)
>>>>>> > MCA plm: parameter "plm_rsh_agent" (current value:
>>>>>> "ssh : rsh", data source: default value, synonyms: pls_rsh_agent)
>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>> >>>>>>>>>>>>>>>
>>>>>>
>>>>>> ERROR4
>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>> >>>>>>>>>>>>>>>
>>>>>> >/usr/bin/ssh -x compute-3-17.local orted --debug-daemons -
>>>>>> mca ess env -mca orte_ess_jobid 2152464384 -mca orte_ess_vpid
>>>>>> 1 -mca orte_ess_num_procs 2 --hnp-uri >"2152464384.0;tcp://
>>>>>> 192.168.4.143:59176" -mca mca_base_param_file_prefix ft-enable-
>>>>>> cr -mca mca_base_param_file_path >/opt/cesga/openmpi-1.3.3/
>>>>>> share/openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz/
>>>>>> mpi_test -mca mca_base_param_file_path_force /home_no_usc/
>>>>>> cesga/sdiaz/mpi_test
>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>> >>>>>>>>>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Josh Hursey escribió:
>>>>>>>
>>>>>>> On Nov 9, 2009, at 5:33 AM, Sergio Díaz wrote:
>>>>>>>
>>>>>>>
>>>>>>>> Hi Josh,
>>>>>>>>
>>>>>>>> The OpenMPI version is 1.3.3.
>>>>>>>>
>>>>>>>> The command ompi-ps doesn't work.
>>>>>>>>
>>>>>>>> [root_at_compute-3-18 ~]# ompi-ps -j 2726959 -p 16241
>>>>>>>> [root_at_compute-3-18 ~]# ompi-ps -v -j 2726959 -p 16241
>>>>>>>> [compute-3-18.local:16254] orte_ps: Acquiring list of HNPs
>>>>>>>> and setting contact info into RML...
>>>>>>>> [root_at_compute-3-18 ~]# ompi-ps -v -j 2726959
>>>>>>>> [compute-3-18.local:16255] orte_ps: Acquiring list of HNPs
>>>>>>>> and setting contact info into RML...
>>>>>>>>
>>>>>>>> [root_at_compute-3-18 ~]# ps uaxf | grep sdiaz
>>>>>>>> root 16260 0.0 0.0 51084 680 pts/0 S+ 13:38
>>>>>>>> 0:00 \_ grep sdiaz
>>>>>>>> sdiaz 16203 0.0 0.0 53164 1220 ? Ss 13:37
>>>>>>>> 0:00 \_ -bash /opt/cesga/sge62/default/spool/
>>>>>>>> compute-3-18/job_scripts/2726959
>>>>>>>> sdiaz 16241 0.0 0.0 41028 2480 ? S 13:37
>>>>>>>> 0:00 \_ mpirun -np 2 -am ft-enable-cr ./pi3
>>>>>>>> sdiaz 16242 0.0 0.0 36484 1840 ? Sl 13:37
>>>>>>>> 0:00 \_ /opt/cesga/sge62/bin/lx24-x86/qrsh -
>>>>>>>> inherit -nostdin -V compute-3-17.local orted -mca ess env -
>>>>>>>> mca orte_ess_jobid 2769879040 -mca orte_ess_vpid 1 -mca
>>>>>>>> orte_ess_num_procs 2 --hnp-uri "2769879040.0;tcp://
>>>>>>>> 192.168.4.143:57010" -mca mca_base_param_file_prefix ft-
>>>>>>>> enable-cr -mca mca_base_param_file_path /opt/cesga/
>>>>>>>> openmpi-1.3.3/share/openmpi/amca-param-sets:/home_no_usc/
>>>>>>>> cesga/sdiaz/mpi_test -mca mca_base_param_file_path_force /
>>>>>>>> home_no_usc/cesga/sdiaz/mpi_test
>>>>>>>> sdiaz 16245 0.1 0.0 99464 4616 ? Sl 13:37
>>>>>>>> 0:00 \_ ./pi3
>>>>>>>>
>>>>>>>> [root_at_compute-3-18 ~]# ompi-ps -n c3-18
>>>>>>>> [root_at_compute-3-18 ~]# ompi-ps -n compute-3-18
>>>>>>>> [root_at_compute-3-18 ~]# ompi-ps -n
>>>>>>>>
>>>>>>>> There is not directory on the /tmp of the node. However, if
>>>>>>>> the application is run without SGE, the directory is created
>>>>>>>>
>>>>>>> This may be the core of the problem. ompi-ps and other
>>>>>>> command line tools (e.g., ompi-checkpoint) look for the Open
>>>>>>> MPI session directory in /tmp in order to find the connection
>>>>>>> information to connect to the mpirun process (internally
>>>>>>> called the HNP or Head Node Process).
>>>>>>>
>>>>>>> Can you change the location of the temporary directory in
>>>>>>> SGE? The temporary directory is usually set via an
>>>>>>> environment variable (e.g., TMPDIR, or TMP). So removing the
>>>>>>> environment variable or setting it to /tmp might help.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> but if I do ompi-ps -j MPIRUN_PID, it seems hanged and I
>>>>>>>> interrupt it. Does it take long time?
>>>>>>>>
>>>>>>> It should not take a long time. It is just querying the
>>>>>>> mpirun process for state information.
>>>>>>>
>>>>>>>
>>>>>>>> what means the option -j of ompi-ps command? isn't it
>>>>>>>> related to a batch system(like sge, condor...), is it?
>>>>>>>>
>>>>>>> The '-j' option allows the user to specify the Open MPI
>>>>>>> jobid. This is completely different than the jobid provided
>>>>>>> by the batch system. In general, users should not need to
>>>>>>> specify the -j option. It is useful when you have multiple
>>>>>>> Open MPI jobs, and want a summary of just one of them.
>>>>>>>
>>>>>>>
>>>>>>>> Thanks for the ticket. I will follow it.
>>>>>>>>
>>>>>>>> Talking with Alan, I realized that there are few transport
>>>>>>>> protocols that are supported. And maybe it is the problem.
>>>>>>>> Currently, SGE is using qrsh to expand mpi process. I can
>>>>>>>> change this protocol and use ssh. So, I'm going to test it
>>>>>>>> this afternoon and I will comment to you the results.
>>>>>>>>
>>>>>>> Try 'ssh' and see if that helps. I suspect the problem is
>>>>>>> with the session directory location though.
>>>>>>>
>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Sergio
>>>>>>>>
>>>>>>>>
>>>>>>>> Josh Hursey escribió:
>>>>>>>>
>>>>>>>>> On Oct 28, 2009, at 7:41 AM, Sergio Díaz wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> I have achieved the checkpoint of an easy program without
>>>>>>>>>> SGE. Now, I'm trying to do the integration openmpi+sge but
>>>>>>>>>> I have some problems... When I try to do checkpoint of the
>>>>>>>>>> mpirun PID, I got an error similar to the error gotten
>>>>>>>>>> when the PID doesn't exit. The example below.
>>>>>>>>>>
>>>>>>>>> I do not have any experience with the SGE environment, so I
>>>>>>>>> suspect that there may something 'special' about the
>>>>>>>>> environment that is tripping up the ompi-checkpoint tool.
>>>>>>>>>
>>>>>>>>> First of all, what version of Open MPI are you using?
>>>>>>>>>
>>>>>>>>> Somethings to check:
>>>>>>>>> - Does 'ompi-ps' work when your application is running?
>>>>>>>>> - Is there an /tmp/openmpi-sessions-* directory on the
>>>>>>>>> node where mpirun is currently running? This directory
>>>>>>>>> contains information on how to connect to the mpirun
>>>>>>>>> process from an external tool, if it's missing then this
>>>>>>>>> could be the cause of the problem.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Any ideas?
>>>>>>>>>> Somebody have a script to do it automatic with SGE?. For
>>>>>>>>>> example I have one to do checkpoint each X seconds with
>>>>>>>>>> BLCR and non-mpi jobs. It is launched by SGE if you have
>>>>>>>>>> configured the queue and the ckpt environment.
>>>>>>>>>>
>>>>>>>>> I do not know of any integration of the Open MPI
>>>>>>>>> checkpointing work with SGE at the moment.
>>>>>>>>>
>>>>>>>>> As far as time triggered checkpointing, I have a feature
>>>>>>>>> ticket open about this:
>>>>>>>>> https://svn.open-mpi.org/trac/ompi/ticket/1961
>>>>>>>>>
>>>>>>>>> It is not available yet, but in the works.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Is it possible choose the name of the ckpt folder when you
>>>>>>>>>> do the ompi-checkpoint? I can't find the option to do it.
>>>>>>>>>>
>>>>>>>>> Not at this time. Though I could see it as a useful
>>>>>>>>> feature, and shouldn't be too hard to implement. I filed a
>>>>>>>>> ticket if you want to follow the progress:
>>>>>>>>> https://svn.open-mpi.org/trac/ompi/ticket/2098
>>>>>>>>>
>>>>>>>>> -- Josh
>>>>>>>>>> Regards,
>>>>>>>>>> Sergio
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --------------------------------
>>>>>>>>>>
>>>>>>>>>> [sdiaz_at_compute-3-17 ~]$ ps auxf
>>>>>>>>>> ....
>>>>>>>>>> root 20044 0.0 0.0 4468 1224 ? S 13:28
>>>>>>>>>> 0:00 \_ sge_shepherd-2645150 -bg
>>>>>>>>>> sdiaz 20072 0.0 0.0 53172 1212 ? Ss 13:28
>>>>>>>>>> 0:00 \_ -bash /opt/cesga/sge62/default/spool/
>>>>>>>>>> compute-3-17/job_scripts/2645150
>>>>>>>>>> sdiaz 20112 0.2 0.0 41028 2480 ? S 13:28
>>>>>>>>>> 0:00 \_ mpirun -np 2 -am ft-enable-cr pi3
>>>>>>>>>> sdiaz 20113 0.0 0.0 36484 1824 ? Sl 13:28
>>>>>>>>>> 0:00 \_ /opt/cesga/sge62/bin/lx24-x86/qrsh -
>>>>>>>>>> inherit -nostdin -V compute-3-18..........
>>>>>>>>>> sdiaz 20116 1.2 0.0 99464 4616 ? Sl 13:28
>>>>>>>>>> 0:00 \_ pi3
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint 20112
>>>>>>>>>> [compute-3-17.local:20124] HNP with PID 20112 Not found!
>>>>>>>>>>
>>>>>>>>>> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint -s 20112
>>>>>>>>>> [compute-3-17.local:20135] HNP with PID 20112 Not found!
>>>>>>>>>>
>>>>>>>>>> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint -s --term 20112
>>>>>>>>>> [compute-3-17.local:20136] HNP with PID 20112 Not found!
>>>>>>>>>>
>>>>>>>>>> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint --hnp-pid 20112
>>>>>>>>>> -------------------------------------------------------------
>>>>>>>>>> -------------
>>>>>>>>>> ompi-checkpoint PID_OF_MPIRUN
>>>>>>>>>> Open MPI Checkpoint Tool
>>>>>>>>>>
>>>>>>>>>> -am <arg0> Aggregate MCA parameter set file
>>>>>>>>>> list
>>>>>>>>>> -gmca|--gmca <arg0> <arg1>
>>>>>>>>>> Pass global MCA parameters that
>>>>>>>>>> are applicable to
>>>>>>>>>> all contexts (arg0 is the
>>>>>>>>>> parameter name; arg1 is
>>>>>>>>>> the parameter value)
>>>>>>>>>> -h|--help This help message
>>>>>>>>>> --hnp-jobid <arg0> This should be the jobid of the
>>>>>>>>>> HNP whose
>>>>>>>>>> applications you wish to checkpoint.
>>>>>>>>>> --hnp-pid <arg0> This should be the pid of the
>>>>>>>>>> mpirun whose
>>>>>>>>>> applications you wish to checkpoint.
>>>>>>>>>> -mca|--mca <arg0> <arg1>
>>>>>>>>>> Pass context-specific MCA
>>>>>>>>>> parameters; they are
>>>>>>>>>> considered global if --gmca is
>>>>>>>>>> not used and only
>>>>>>>>>> one context is specified (arg0 is
>>>>>>>>>> the parameter
>>>>>>>>>> name; arg1 is the parameter value)
>>>>>>>>>> -s|--status Display status messages
>>>>>>>>>> describing the progression
>>>>>>>>>> of the checkpoint
>>>>>>>>>> --term Terminate the application after
>>>>>>>>>> checkpoint
>>>>>>>>>> -v|--verbose Be Verbose
>>>>>>>>>> -w|--nowait Do not wait for the application
>>>>>>>>>> to finish
>>>>>>>>>> checkpointing before returning
>>>>>>>>>>
>>>>>>>>>> -------------------------------------------------------------
>>>>>>>>>> -------------
>>>>>>>>>> [sdiaz_at_compute-3-17 ~]$ exit
>>>>>>>>>> logout
>>>>>>>>>> Connection to c3-17 closed.
>>>>>>>>>> [sdiaz_at_svgd mpi_test]$ ssh c3-18
>>>>>>>>>> Last login: Wed Oct 28 13:24:12 2009 from svgd.local
>>>>>>>>>> -bash-3.00$ ps auxf |grep sdiaz
>>>>>>>>>>
>>>>>>>>>> sdiaz 14412 0.0 0.0 1888 560 ? Ss 13:28
>>>>>>>>>> 0:00 \_ /opt/cesga/sge62/utilbin/lx24-x86/
>>>>>>>>>> qrsh_starter /opt/cesga/sge62/default/spool/compute-3-18/
>>>>>>>>>> active_jobs/2645150.1/1.compute-3-18
>>>>>>>>>> sdiaz 14419 0.0 0.0 35728 2260 ? S 13:28
>>>>>>>>>> 0:00 \_ orted -mca ess env -mca orte_ess_jobid
>>>>>>>>>> 2295267328 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2
>>>>>>>>>> --hnp-uri 2295267328.0;tcp://192.168.4.144:36596 -mca
>>>>>>>>>> mca_base_param_file_prefix ft-enable-cr -mca
>>>>>>>>>> mca_base_param_file_path /opt/cesga/openmpi-1.3.3/share/
>>>>>>>>>> openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz/mpi_test -
>>>>>>>>>> mca mca_base_param_file_path_force /home_no_usc/cesga/
>>>>>>>>>> sdiaz/mpi_test
>>>>>>>>>> sdiaz 14420 0.0 0.0 99452 4596 ? Sl 13:28
>>>>>>>>>> 0:00 \_ pi3
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Sergio Díaz Montes
>>>>>>>>>> Centro de Supercomputacion de Galicia
>>>>>>>>>> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de
>>>>>>>>>> Compostela (Spain)
>>>>>>>>>> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
>>>>>>>>>> email: sdiaz_at_[hidden] ; http://www.cesga.es/
>>>>>>>>>> <image002.jpg>
>>>>>>>>>> ------------------------------------------------
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> users_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>
>>>>>>>>>
>>>>>>>> --
>>>>>>>> Sergio Díaz Montes
>>>>>>>> Centro de Supercomputacion de Galicia
>>>>>>>> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela
>>>>>>>> (Spain)
>>>>>>>> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
>>>>>>>> email: sdiaz_at_[hidden] ; http://www.cesga.es/
>>>>>>>> <image002.jpg>
>>>>>>>> ------------------------------------------------
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Sergio Díaz Montes
>>>>>> Centro de Supercomputacion de Galicia
>>>>>> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela
>>>>>> (Spain)
>>>>>> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
>>>>>> email: sdiaz_at_[hidden] ; http://www.cesga.es/
>>>>>> <image002.jpg>
>>>>>> ------------------------------------------------
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>
>>>>
>>>> --
>>>> Sergio Díaz Montes
>>>> Centro de Supercomputacion de Galicia
>>>> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela
>>>> (Spain)
>>>> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
>>>> email: sdiaz_at_[hidden] ; http://www.cesga.es/
>>>> <mime-attachment.jpeg>
>>>> ------------------------------------------------
>>>> _______________________________________________ users mailing
>>>> list users_at_[hidden] http://www.open-mpi.org/mailman/
>>>> listinfo.cgi/users
>>>
>>>
>>> --
>>> Sergio Díaz Montes
>>> Centro de Supercomputacion de Galicia
>>> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
>>> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
>>> email: sdiaz_at_[hidden] ; http://www.cesga.es/
>>> <image002.jpg>
>>> ------------------------------------------------
>>> <mime-attachment.jpeg><image002.jpg>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> --
> Sergio Díaz Montes
> Centro de Supercomputacion de Galicia
> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
> email: sdiaz_at_[hidden] ; http://www.cesga.es/
> <image002.jpg>
> ------------------------------------------------
> <image002.jpg>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users