Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] checkpoint opempi-1.3.3+sge62
From: Sergio Díaz (sdiaz_at_[hidden])
Date: 2009-12-15 03:56:48


Hi,

Thanks Reuti. These links were very useful when I did the integration of
BLCR with SGE. I will review them to check if there is more useful
information.

Regards,
Sergio

Reuti escribió:
> Hi,
>
> no, I never tried Open MPI's checkpointing. But there are two Howto's
> from which you may get some ideas to integrate it with SGE:
>
> http://gridengine.sunsource.net/howto/checkpointing.html
> http://gridengine.sunsource.net/howto/APSTC-TB-2004-005.pdf (but Open
> MPI's checkpointing seems more to be like Condor's, as you don't have
> to deal with any process list on your own AFAIK)
>
> Included is also an example to integrate SGE with the Condor
> checkpointing library in standalone mode.
>
> Purpose of the checkpointing interface can be to copy the files from a
> local (checkpointing) directory on a node to a shared space like
> /home/checkpoint (the $SGE_CKPT_DIR [I even greated a subdirectory
> with the $JOB_ID therein in the examples]). Later on the files can be
> copied to the (maybe different) nodes again (either in a queue prolog
> or the job script) when the job restarts.
>
>
> -- Reuti
>
>
> Am 14.12.2009 um 18:25 schrieb Sergio Díaz:
>
>> Hi Reuti,
>>
>> Yes, I sent a job with SGE and I checkpointed the mpirun process, by
>> hand, entering into the mpi master node. Then I killed the job with
>> qdel and after that I did the ompi-restart.
>> I will try to integrate with SGE creating a ckpt environment but I
>> think that it could be a bit difficult because:
>> 1 - when I do checkpoint I can't specify a directory with a
>> name like checkpoint_jobid
>> 2 - I can't specify the scratch directory and I have to use
>> the /tmp instead of SGE's scratch directory.
>> 3 - I tried to restart the snapshot and it only works if I
>> use the same machinefile. That is, If the job ran in the c3-13 and
>> c3-14, I have to restart the job using a machinefile with these two
>> nodes.
>>
>> [sdiaz_at_svgd ~]$ ompi-restart -v -machinefile
>> mpi_test/lanzar_pi3.sh.po3117822 ompi_global_snapshot_12554.ckpt
>> [svgd.cesga.es:28836] Checking for the existence
>> of (/home/cesga/sdiaz/ompi_global_snapshot_12554.ckpt)
>> [svgd.cesga.es:28836] Restarting from file
>> (ompi_global_snapshot_12554.ckpt)
>> [svgd.cesga.es:28836] Exec in self
>> tiempo 110
>> Process 1 :
>> compute-3-14.local
>> of 2
>> tiempo 110
>> Process 0 :
>> compute-3-13.local
>> of 2
>>
>> --------------------------------------------------------------------------
>>
>> mpirun noticed that process rank 1 with PID
>> 8477 on node compute-3-15 exited on signal 11 (Segmentation fault).
>>
>> --------------------------------------------------------------------------
>>
>>
>> To solve problem 1, there is a feature opened by Josh.
>> (https://svn.open-mpi.org/trac/ompi/ticket/2098)
>> To solve problem 2, there is a thread in which is talked ([OMPI
>> users] Changing location where checkpoints are saved) and also a bug
>> opened by Josh. https://svn.open-mpi.org/trac/ompi/ticket/2139 . I
>> think that it could work... we will see.
>> To solve problem 3, I didn't have time to search it. But if Josh or
>> anyone have an idea... please tell to us :-)
>>
>> Reuti, Did you test it successfully? How do you solve these problems?
>>
>> Regards,
>> Sergio
>>
>>
>> Reuti escribió:
>>>
>>> Hi,
>>>
>>> Am 14.12.2009 um 17:05 schrieb Sergio Díaz:
>>>
>>>> I got a successful checkpoint with a fresh installation and without
>>>> use the trunk. I can't understand why it is working now and before
>>>> I could do a successful restart... Maybe there was something wrong
>>>> in the openmpi installation and then the metadata was created in a
>>>> wrong way.
>>>> I will test it more and also I will test the trunk.
>>>>
>>>> Regards,
>>>> Sergio
>>>>
>>>> [sdiaz_at_compute-3-13 ~]$ ompi-restart -machinefile
>>>> mpi_test/lanzar_pi3.sh.po3117822 ompi_global_snapshot_12554.ckpt
>>>> tiempo 110
>>>> Process 1 :
>>>> compute-3-14.local
>>>> of 2
>>>> tiempo 110
>>>> Process 0 :
>>>> compute-3-13.local
>>>> of 2
>>>> tiempo 120
>>>> Process 1 :
>>>> compute-3-14.local
>>>> of 2
>>>> tiempo 120
>>>> Process 0 :
>>>> compute-3-13.local
>>>> ...
>>>> ...
>>>>
>>>> [sdiaz_at_compute-3-14 ~]$ ps auxf |grep sdiaz
>>>> sdiaz 26273 0.0 0.0 34676 1668 ? Ss 15:58 0:00
>>>> orted --daemonize -mca ess env -mca orte_ess_jobid 1739128832 -mca
>>>
>>> in a Tight Integration into SGE the daemon should get the argument
>>> --no-daemonize. Are you restarting a job on the command line, which
>>> ran before under SGE's supervision?
>>>
>>> -- Reuti
>>>
>>>> orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri
>>>> 1739128832.0;tcp://192.168.4.148:45551 -mca
>>>> mca_base_param_file_prefix ft-enable-cr -mca
>>>> mca_base_param_file_path
>>>> /opt/cesga/openmpi-1.3.3_bis/share/openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz
>>>> -mca mca_base_param_file_path_force /home_no_usc/cesga/sdiaz
>>>> sdiaz 26274 0.1 0.0 15984 504 ? Sl 15:58 0:00 \_
>>>> cr_restart
>>>> /home/cesga/sdiaz/ompi_global_snapshot_12554.ckpt/0/opal_snapshot_1.ckpt/ompi_blcr_context.26047
>>>>
>>>> sdiaz 26047 1.5 0.0 99460 3624 ? Sl 15:58 0:00
>>>> \_ ./pi3
>>>>
>>>> [sdiaz_at_compute-3-13 ~]$ ps auxf |grep sdiaz
>>>> root 12878 0.0 0.0 90260 3000 pts/0 S 15:55 0:00
>>>> | \_ su - sdiaz
>>>> sdiaz 12880 0.0 0.0 53432 1512 pts/0 S 15:55 0:00
>>>> | \_ -bash
>>>> sdiaz 13070 0.3 0.0 39988 2500 pts/0 S+ 15:58 0:00
>>>> | \_ mpirun -am ft-enable-cr --default-hostfile
>>>> mpi_test/lanzar_pi3.sh.po3117822 --app
>>>> /home/cesga/sdiaz/ompi_global_snapshot_12554.ckpt/restart-appfile
>>>> sdiaz 13073 0.0 0.0 15988 508 pts/0 Sl+ 15:58 0:00
>>>> | \_ cr_restart
>>>> /home/cesga/sdiaz/ompi_global_snapshot_12554.ckpt/0/opal_snapshot_0.ckpt/ompi_blcr_context.12558
>>>>
>>>> sdiaz 12558 0.2 0.0 99464 3616 pts/0 Sl+ 15:58 0:00
>>>> | \_ ./pi3
>>>>
>>>>
>>>> Sergio Díaz escribió:
>>>>>
>>>>> Hi Josh
>>>>>
>>>>> Here you go the file.
>>>>>
>>>>> I will try to apply the trunk but I think that I broke-up my
>>>>> openmpi installation doing "something" and I don't know what :-( .
>>>>> I was modifying the mca parameters...
>>>>> When I send a job, the orted daemon expanded in the SLAVE host is
>>>>> launched in a bucle till they spend all the reserved memory.
>>>>> It is very strange so I will compile it again, I will reproduce
>>>>> the bug and then I will test the trunk.
>>>>>
>>>>> Thanks a lot for the support and tickets opened.
>>>>> Sergio
>>>>>
>>>>>
>>>>> sdiaz 30279 0.0 0.0 1888 560 ? Ds 12:54
>>>>> 0:00 \_ /opt/cesga/sge62/utilbin/lx24-x86/qrsh_starter
>>>>> /opt/cesga/sge62/default/spool/compute
>>>>> sdiaz 30286 0.0 0.0 52772 1188 ? D 12:54
>>>>> 0:00 \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted -mca
>>>>> ess env -mca orte_ess_jobid 219
>>>>> sdiaz 30322 0.0 0.0 52772 1188 ? S 12:54
>>>>> 0:00 \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted
>>>>> sdiaz 30358 0.0 0.0 52772 1188 ? D 12:54
>>>>> 0:00 \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted
>>>>> sdiaz 30394 0.0 0.0 52772 1188 ? D 12:54
>>>>> 0:00 \_ /bin/bash
>>>>> /opt/cesga/openmpi-1.3.3/bin/orted
>>>>> sdiaz 30430 0.0 0.0 52772 1188 ? D 12:54
>>>>> 0:00 \_ /bin/bash
>>>>> /opt/cesga/openmpi-1.3.3/bin/orted
>>>>> sdiaz 30466 0.0 0.0 52772 1188 ? D 12:54
>>>>> 0:00 \_ /bin/bash
>>>>> /opt/cesga/openmpi-1.3.3/bin/orted
>>>>> sdiaz 30502 0.0 0.0 52772 1188 ? D 12:54
>>>>> 0:00 \_ /bin/bash
>>>>> /opt/cesga/openmpi-1.3.3/bin/orted
>>>>> sdiaz 30538 0.0 0.0 52772 1188 ? D 12:54
>>>>> 0:00 \_ /bin/bash
>>>>> /opt/cesga/openmpi-1.3.3/bin/orted
>>>>> sdiaz 30574 0.0 0.0 52772 1188 ? D 12:54
>>>>> 0:00 \_ /bin/bash
>>>>> /opt/cesga/openmpi-1.3.3/bin/orted
>>>>> ....
>>>>>
>>>>>
>>>>>
>>>>> Josh Hursey escribió:
>>>>>>
>>>>>>
>>>>>> On Nov 12, 2009, at 10:54 AM, Sergio Díaz wrote:
>>>>>>
>>>>>>> Hi Josh,
>>>>>>>
>>>>>>> You were right. The main problem was the /tmp. SGE uses a
>>>>>>> scratch directory in which the jobs have temporary files.
>>>>>>> Setting TMPDIR to /tmp, checkpoint works!
>>>>>>> However, when I try to restart it... I got the following error
>>>>>>> (see ERROR1). Option -v agrees these lines (see ERRO2).
>>>>>>
>>>>>> It is concerning that ompi-restart is segfault'ing when it errors
>>>>>> out. The error message is being generated between the launch of
>>>>>> the opal-restart starter command and when we try to
>>>>>> exec(cr_restart). Usually the failure is related to a corruption
>>>>>> of the metadata stored in the checkpoint.
>>>>>>
>>>>>> Can you send me the file below:
>>>>>> ompi_global_snapshot_28454.ckpt/0/opal_snapshot_0.ckpt/snapshot_meta.data
>>>>>>
>>>>>>
>>>>>> I was able to reproduce the segv (at least I think it is the same
>>>>>> one). We failed to check the validity of a string when we parse
>>>>>> the metadata. I committed a fix to the trunk in r22290, and
>>>>>> requested that the fix be moved to the v1.4 and v1.5 branches. If
>>>>>> you are interested in seeing when they get applied you can follow
>>>>>> the following tickets:
>>>>>> https://svn.open-mpi.org/trac/ompi/ticket/2140
>>>>>> https://svn.open-mpi.org/trac/ompi/ticket/2141
>>>>>>
>>>>>> Can you try the trunk to see if the problem goes away? The
>>>>>> development trunk and v1.5 series have a bunch of improvements to
>>>>>> the C/R functionality that were never brought over the v1.3/v1.4
>>>>>> series.
>>>>>>
>>>>>>>
>>>>>>> I was trying to use ssh instead of rsh but I was impossible. By
>>>>>>> default it should use ssh and if it finds a problem, it will use
>>>>>>> rsh. It seems that ssh doesn't work because always use rsh.
>>>>>>> If I change this MCA parameter, It still uses rsh.
>>>>>>> If I set OMPI_MCA_plm_rsh_disable_qrsh variable to 1, It try to
>>>>>>> use ssh and doesn't works. I got --> "bash: orted: command not
>>>>>>> found" and the mpi process dies.
>>>>>>> The command which try to execute is the following and I haven't
>>>>>>> found yet the reason why this command doesn't found orted
>>>>>>> because I set the /etc/bashrc in order to get always the right
>>>>>>> path and I have the right path into my application. (see ERROR4).
>>>>>>
>>>>>> This seems like an SGE specific issue, so a bit out of my domain.
>>>>>> Maybe others have suggestions here.
>>>>>>
>>>>>> -- Josh
>>>>>>>
>>>>>>>
>>>>>>> Many thanks!,
>>>>>>> Sergio
>>>>>>>
>>>>>>> P.S. Sorry about these long emails. I just try to show you
>>>>>>> useful information to identify my problems.
>>>>>>>
>>>>>>>
>>>>>>> ERROR 1
>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>
>>>>>>> > [sdiaz_at_compute-3-18 ~]$ ompi-restart
>>>>>>> ompi_global_snapshot_28454.ckpt
>>>>>>> >
>>>>>>> --------------------------------------------------------------------------
>>>>>>>
>>>>>>> > Error: Unable to obtain the proper restart command to restart
>>>>>>> from the
>>>>>>> > checkpoint file (opal_snapshot_0.ckpt). Returned -1.
>>>>>>> >
>>>>>>> >
>>>>>>> --------------------------------------------------------------------------
>>>>>>>
>>>>>>> >
>>>>>>> --------------------------------------------------------------------------
>>>>>>>
>>>>>>> > Error: Unable to obtain the proper restart command to restart
>>>>>>> from the
>>>>>>> > checkpoint file (opal_snapshot_1.ckpt). Returned -1.
>>>>>>> >
>>>>>>> >
>>>>>>> --------------------------------------------------------------------------
>>>>>>>
>>>>>>> > [compute-3-18:28792] *** Process received signal ***
>>>>>>> > [compute-3-18:28792] Signal: Segmentation fault (11)
>>>>>>> > [compute-3-18:28792] Signal code: (128)
>>>>>>> > [compute-3-18:28792] Failing at address: (nil)
>>>>>>> > [compute-3-18:28792] [ 0] /lib64/tls/libpthread.so.0
>>>>>>> [0x33bbf0c430]
>>>>>>> > [compute-3-18:28792] [ 1]
>>>>>>> /lib64/tls/libc.so.6(__libc_free+0x25) [0x33bb669135]
>>>>>>> > [compute-3-18:28792] [ 2]
>>>>>>> /opt/cesga/openmpi-1.3.3/lib/libopen-pal.so.0(opal_argv_free+0x2e)
>>>>>>> [0x2a95586658]
>>>>>>> > [compute-3-18:28792] [ 3]
>>>>>>> /opt/cesga/openmpi-1.3.3/lib/libopen-pal.so.0(opal_event_fini+0x1e)
>>>>>>> [0x2a9557906e]
>>>>>>> > [compute-3-18:28792] [ 4]
>>>>>>> /opt/cesga/openmpi-1.3.3/lib/libopen-pal.so.0(opal_finalize+0x36)
>>>>>>> [0x2a9556bcfa]
>>>>>>> > [compute-3-18:28792] [ 5] opal-restart [0x40312a]
>>>>>>> > [compute-3-18:28792] [ 6]
>>>>>>> /lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x33bb61c3fb]
>>>>>>> > [compute-3-18:28792] [ 7] opal-restart [0x40272a]
>>>>>>> > [compute-3-18:28792] *** End of error message ***
>>>>>>> > [compute-3-18:28793] *** Process received signal ***
>>>>>>> > [compute-3-18:28793] Signal: Segmentation fault (11)
>>>>>>> > [compute-3-18:28793] Signal code: (128)
>>>>>>> > [compute-3-18:28793] Failing at address: (nil)
>>>>>>> > [compute-3-18:28793] [ 0] /lib64/tls/libpthread.so.0
>>>>>>> [0x33bbf0c430]
>>>>>>> > [compute-3-18:28793] [ 1]
>>>>>>> /lib64/tls/libc.so.6(__libc_free+0x25) [0x33bb669135]
>>>>>>> > [compute-3-18:28793] [ 2]
>>>>>>> /opt/cesga/openmpi-1.3.3/lib/libopen-pal.so.0(opal_argv_free+0x2e)
>>>>>>> [0x2a95586658]
>>>>>>> > [compute-3-18:28793] [ 3]
>>>>>>> /opt/cesga/openmpi-1.3.3/lib/libopen-pal.so.0(opal_event_fini+0x1e)
>>>>>>> [0x2a9557906e]
>>>>>>> > [compute-3-18:28793] [ 4]
>>>>>>> /opt/cesga/openmpi-1.3.3/lib/libopen-pal.so.0(opal_finalize+0x36)
>>>>>>> [0x2a9556bcfa]
>>>>>>> > [compute-3-18:28793] [ 5] opal-restart [0x40312a]
>>>>>>> > [compute-3-18:28793] [ 6]
>>>>>>> /lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x33bb61c3fb]
>>>>>>> > [compute-3-18:28793] [ 7] opal-restart [0x40272a]
>>>>>>> > [compute-3-18:28793] *** End of error message ***
>>>>>>> >
>>>>>>> --------------------------------------------------------------------------
>>>>>>>
>>>>>>> > mpirun noticed that process rank 0 with PID 28792 on node
>>>>>>> compute-3-18.local exited on signal 11 (Segmentation fault).
>>>>>>> >
>>>>>>> --------------------------------------------------------------------------
>>>>>>>
>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ERROR 2
>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>
>>>>>>> > [sdiaz_at_compute-3-18 ~]$ ompi-restart -v
>>>>>>> ompi_global_snapshot_28454.ckpt
>>>>>>> >[compute-3-18.local:28941] Checking for the existence of
>>>>>>> (/home/cesga/sdiaz/ompi_global_snapshot_28454.ckpt)
>>>>>>> > [compute-3-18.local:28941] Restarting from file
>>>>>>> (ompi_global_snapshot_28454.ckpt)
>>>>>>> > [compute-3-18.local:28941] Exec in self
>>>>>>> > .......
>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ERROR3
>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>
>>>>>>> >[sdiaz_at_compute-3-18 ~]$ ompi_info --all|grep "plm_rsh_agent"
>>>>>>> > How many plm_rsh_agent instances to invoke
>>>>>>> concurrently (must be > 0)
>>>>>>> > MCA plm: parameter "plm_rsh_agent" (current value:
>>>>>>> "ssh : rsh", data source: default value, synonyms: pls_rsh_agent)
>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ERROR4
>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>
>>>>>>> >/usr/bin/ssh -x compute-3-17.local orted --debug-daemons -mca
>>>>>>> ess env -mca orte_ess_jobid 2152464384 -mca orte_ess_vpid 1 -mca
>>>>>>> orte_ess_num_procs 2 --hnp-uri
>>>>>>> >"2152464384.0;tcp://192.168.4.143:59176" -mca
>>>>>>> mca_base_param_file_prefix ft-enable-cr -mca
>>>>>>> mca_base_param_file_path
>>>>>>> >/opt/cesga/openmpi-1.3.3/share/openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz/mpi_test
>>>>>>> -mca mca_base_param_file_path_force
>>>>>>> /home_no_usc/cesga/sdiaz/mpi_test
>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Josh Hursey escribió:
>>>>>>>>
>>>>>>>> On Nov 9, 2009, at 5:33 AM, Sergio Díaz wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> Hi Josh,
>>>>>>>>>
>>>>>>>>> The OpenMPI version is 1.3.3.
>>>>>>>>>
>>>>>>>>> The command ompi-ps doesn't work.
>>>>>>>>>
>>>>>>>>> [root_at_compute-3-18 ~]# ompi-ps -j 2726959 -p 16241
>>>>>>>>> [root_at_compute-3-18 ~]# ompi-ps -v -j 2726959 -p 16241
>>>>>>>>> [compute-3-18.local:16254] orte_ps: Acquiring list of HNPs and
>>>>>>>>> setting contact info into RML...
>>>>>>>>> [root_at_compute-3-18 ~]# ompi-ps -v -j 2726959
>>>>>>>>> [compute-3-18.local:16255] orte_ps: Acquiring list of HNPs and
>>>>>>>>> setting contact info into RML...
>>>>>>>>>
>>>>>>>>> [root_at_compute-3-18 ~]# ps uaxf | grep sdiaz
>>>>>>>>> root 16260 0.0 0.0 51084 680 pts/0 S+ 13:38
>>>>>>>>> 0:00 \_ grep sdiaz
>>>>>>>>> sdiaz 16203 0.0 0.0 53164 1220 ? Ss 13:37
>>>>>>>>> 0:00 \_ -bash
>>>>>>>>> /opt/cesga/sge62/default/spool/compute-3-18/job_scripts/2726959
>>>>>>>>> sdiaz 16241 0.0 0.0 41028 2480 ? S 13:37
>>>>>>>>> 0:00 \_ mpirun -np 2 -am ft-enable-cr ./pi3
>>>>>>>>> sdiaz 16242 0.0 0.0 36484 1840 ? Sl 13:37
>>>>>>>>> 0:00 \_ /opt/cesga/sge62/bin/lx24-x86/qrsh
>>>>>>>>> -inherit -nostdin -V compute-3-17.local orted -mca ess env
>>>>>>>>> -mca orte_ess_jobid 2769879040 -mca orte_ess_vpid 1 -mca
>>>>>>>>> orte_ess_num_procs 2 --hnp-uri
>>>>>>>>> "2769879040.0;tcp://192.168.4.143:57010" -mca
>>>>>>>>> mca_base_param_file_prefix ft-enable-cr -mca
>>>>>>>>> mca_base_param_file_path
>>>>>>>>> /opt/cesga/openmpi-1.3.3/share/openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz/mpi_test
>>>>>>>>> -mca mca_base_param_file_path_force
>>>>>>>>> /home_no_usc/cesga/sdiaz/mpi_test
>>>>>>>>> sdiaz 16245 0.1 0.0 99464 4616 ? Sl 13:37
>>>>>>>>> 0:00 \_ ./pi3
>>>>>>>>>
>>>>>>>>> [root_at_compute-3-18 ~]# ompi-ps -n c3-18
>>>>>>>>> [root_at_compute-3-18 ~]# ompi-ps -n compute-3-18
>>>>>>>>> [root_at_compute-3-18 ~]# ompi-ps -n
>>>>>>>>>
>>>>>>>>> There is not directory on the /tmp of the node. However, if
>>>>>>>>> the application is run without SGE, the directory is created
>>>>>>>>>
>>>>>>>> This may be the core of the problem. ompi-ps and other command
>>>>>>>> line tools (e.g., ompi-checkpoint) look for the Open MPI
>>>>>>>> session directory in /tmp in order to find the connection
>>>>>>>> information to connect to the mpirun process (internally called
>>>>>>>> the HNP or Head Node Process).
>>>>>>>>
>>>>>>>> Can you change the location of the temporary directory in SGE?
>>>>>>>> The temporary directory is usually set via an environment
>>>>>>>> variable (e.g., TMPDIR, or TMP). So removing the environment
>>>>>>>> variable or setting it to /tmp might help.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> but if I do ompi-ps -j MPIRUN_PID, it seems hanged and I
>>>>>>>>> interrupt it. Does it take long time?
>>>>>>>>>
>>>>>>>> It should not take a long time. It is just querying the mpirun
>>>>>>>> process for state information.
>>>>>>>>
>>>>>>>>
>>>>>>>>> what means the option -j of ompi-ps command? isn't it related
>>>>>>>>> to a batch system(like sge, condor...), is it?
>>>>>>>>>
>>>>>>>> The '-j' option allows the user to specify the Open MPI jobid.
>>>>>>>> This is completely different than the jobid provided by the
>>>>>>>> batch system. In general, users should not need to specify the
>>>>>>>> -j option. It is useful when you have multiple Open MPI jobs,
>>>>>>>> and want a summary of just one of them.
>>>>>>>>
>>>>>>>>
>>>>>>>>> Thanks for the ticket. I will follow it.
>>>>>>>>>
>>>>>>>>> Talking with Alan, I realized that there are few transport
>>>>>>>>> protocols that are supported. And maybe it is the problem.
>>>>>>>>> Currently, SGE is using qrsh to expand mpi process. I can
>>>>>>>>> change this protocol and use ssh. So, I'm going to test it
>>>>>>>>> this afternoon and I will comment to you the results.
>>>>>>>>>
>>>>>>>> Try 'ssh' and see if that helps. I suspect the problem is with
>>>>>>>> the session directory location though.
>>>>>>>>
>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Sergio
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Josh Hursey escribió:
>>>>>>>>>
>>>>>>>>>> On Oct 28, 2009, at 7:41 AM, Sergio Díaz wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> I have achieved the checkpoint of an easy program without
>>>>>>>>>>> SGE. Now, I'm trying to do the integration openmpi+sge but I
>>>>>>>>>>> have some problems... When I try to do checkpoint of the
>>>>>>>>>>> mpirun PID, I got an error similar to the error gotten when
>>>>>>>>>>> the PID doesn't exit. The example below.
>>>>>>>>>>>
>>>>>>>>>> I do not have any experience with the SGE environment, so I
>>>>>>>>>> suspect that there may something 'special' about the
>>>>>>>>>> environment that is tripping up the ompi-checkpoint tool.
>>>>>>>>>>
>>>>>>>>>> First of all, what version of Open MPI are you using?
>>>>>>>>>>
>>>>>>>>>> Somethings to check:
>>>>>>>>>> - Does 'ompi-ps' work when your application is running?
>>>>>>>>>> - Is there an /tmp/openmpi-sessions-* directory on the node
>>>>>>>>>> where mpirun is currently running? This directory contains
>>>>>>>>>> information on how to connect to the mpirun process from an
>>>>>>>>>> external tool, if it's missing then this could be the cause
>>>>>>>>>> of the problem.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Any ideas?
>>>>>>>>>>> Somebody have a script to do it automatic with SGE?. For
>>>>>>>>>>> example I have one to do checkpoint each X seconds with BLCR
>>>>>>>>>>> and non-mpi jobs. It is launched by SGE if you have
>>>>>>>>>>> configured the queue and the ckpt environment.
>>>>>>>>>>>
>>>>>>>>>> I do not know of any integration of the Open MPI
>>>>>>>>>> checkpointing work with SGE at the moment.
>>>>>>>>>>
>>>>>>>>>> As far as time triggered checkpointing, I have a feature
>>>>>>>>>> ticket open about this:
>>>>>>>>>> https://svn.open-mpi.org/trac/ompi/ticket/1961
>>>>>>>>>>
>>>>>>>>>> It is not available yet, but in the works.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Is it possible choose the name of the ckpt folder when you
>>>>>>>>>>> do the ompi-checkpoint? I can't find the option to do it.
>>>>>>>>>>>
>>>>>>>>>> Not at this time. Though I could see it as a useful feature,
>>>>>>>>>> and shouldn't be too hard to implement. I filed a ticket if
>>>>>>>>>> you want to follow the progress:
>>>>>>>>>> https://svn.open-mpi.org/trac/ompi/ticket/2098
>>>>>>>>>>
>>>>>>>>>> -- Josh
>>>>>>>>>>> Regards,
>>>>>>>>>>> Sergio
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --------------------------------
>>>>>>>>>>>
>>>>>>>>>>> [sdiaz_at_compute-3-17 ~]$ ps auxf
>>>>>>>>>>> ....
>>>>>>>>>>> root 20044 0.0 0.0 4468 1224 ? S 13:28
>>>>>>>>>>> 0:00 \_ sge_shepherd-2645150 -bg
>>>>>>>>>>> sdiaz 20072 0.0 0.0 53172 1212 ? Ss 13:28
>>>>>>>>>>> 0:00 \_ -bash
>>>>>>>>>>> /opt/cesga/sge62/default/spool/compute-3-17/job_scripts/2645150
>>>>>>>>>>> sdiaz 20112 0.2 0.0 41028 2480 ? S 13:28
>>>>>>>>>>> 0:00 \_ mpirun -np 2 -am ft-enable-cr pi3
>>>>>>>>>>> sdiaz 20113 0.0 0.0 36484 1824 ? Sl 13:28
>>>>>>>>>>> 0:00 \_ /opt/cesga/sge62/bin/lx24-x86/qrsh
>>>>>>>>>>> -inherit -nostdin -V compute-3-18..........
>>>>>>>>>>> sdiaz 20116 1.2 0.0 99464 4616 ? Sl 13:28
>>>>>>>>>>> 0:00 \_ pi3
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint 20112
>>>>>>>>>>> [compute-3-17.local:20124] HNP with PID 20112 Not found!
>>>>>>>>>>>
>>>>>>>>>>> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint -s 20112
>>>>>>>>>>> [compute-3-17.local:20135] HNP with PID 20112 Not found!
>>>>>>>>>>>
>>>>>>>>>>> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint -s --term 20112
>>>>>>>>>>> [compute-3-17.local:20136] HNP with PID 20112 Not found!
>>>>>>>>>>>
>>>>>>>>>>> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint --hnp-pid 20112
>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>
>>>>>>>>>>> ompi-checkpoint PID_OF_MPIRUN
>>>>>>>>>>> Open MPI Checkpoint Tool
>>>>>>>>>>>
>>>>>>>>>>> -am <arg0> Aggregate MCA parameter set file list
>>>>>>>>>>> -gmca|--gmca <arg0> <arg1>
>>>>>>>>>>> Pass global MCA parameters that are
>>>>>>>>>>> applicable to
>>>>>>>>>>> all contexts (arg0 is the parameter
>>>>>>>>>>> name; arg1 is
>>>>>>>>>>> the parameter value)
>>>>>>>>>>> -h|--help This help message
>>>>>>>>>>> --hnp-jobid <arg0> This should be the jobid of the HNP
>>>>>>>>>>> whose
>>>>>>>>>>> applications you wish to checkpoint.
>>>>>>>>>>> --hnp-pid <arg0> This should be the pid of the
>>>>>>>>>>> mpirun whose
>>>>>>>>>>> applications you wish to checkpoint.
>>>>>>>>>>> -mca|--mca <arg0> <arg1>
>>>>>>>>>>> Pass context-specific MCA
>>>>>>>>>>> parameters; they are
>>>>>>>>>>> considered global if --gmca is not
>>>>>>>>>>> used and only
>>>>>>>>>>> one context is specified (arg0 is
>>>>>>>>>>> the parameter
>>>>>>>>>>> name; arg1 is the parameter value)
>>>>>>>>>>> -s|--status Display status messages describing
>>>>>>>>>>> the progression
>>>>>>>>>>> of the checkpoint
>>>>>>>>>>> --term Terminate the application after
>>>>>>>>>>> checkpoint
>>>>>>>>>>> -v|--verbose Be Verbose
>>>>>>>>>>> -w|--nowait Do not wait for the application to
>>>>>>>>>>> finish
>>>>>>>>>>> checkpointing before returning
>>>>>>>>>>>
>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>
>>>>>>>>>>> [sdiaz_at_compute-3-17 ~]$ exit
>>>>>>>>>>> logout
>>>>>>>>>>> Connection to c3-17 closed.
>>>>>>>>>>> [sdiaz_at_svgd mpi_test]$ ssh c3-18
>>>>>>>>>>> Last login: Wed Oct 28 13:24:12 2009 from svgd.local
>>>>>>>>>>> -bash-3.00$ ps auxf |grep sdiaz
>>>>>>>>>>>
>>>>>>>>>>> sdiaz 14412 0.0 0.0 1888 560 ? Ss 13:28
>>>>>>>>>>> 0:00 \_ /opt/cesga/sge62/utilbin/lx24-x86/qrsh_starter
>>>>>>>>>>> /opt/cesga/sge62/default/spool/compute-3-18/active_jobs/2645150.1/1.compute-3-18
>>>>>>>>>>>
>>>>>>>>>>> sdiaz 14419 0.0 0.0 35728 2260 ? S 13:28
>>>>>>>>>>> 0:00 \_ orted -mca ess env -mca orte_ess_jobid
>>>>>>>>>>> 2295267328 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2
>>>>>>>>>>> --hnp-uri 2295267328.0;tcp://192.168.4.144:36596 -mca
>>>>>>>>>>> mca_base_param_file_prefix ft-enable-cr -mca
>>>>>>>>>>> mca_base_param_file_path
>>>>>>>>>>> /opt/cesga/openmpi-1.3.3/share/openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz/mpi_test
>>>>>>>>>>> -mca mca_base_param_file_path_force
>>>>>>>>>>> /home_no_usc/cesga/sdiaz/mpi_test
>>>>>>>>>>> sdiaz 14420 0.0 0.0 99452 4596 ? Sl 13:28
>>>>>>>>>>> 0:00 \_ pi3
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Sergio Díaz Montes
>>>>>>>>>>> Centro de Supercomputacion de Galicia
>>>>>>>>>>> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela
>>>>>>>>>>> (Spain)
>>>>>>>>>>> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
>>>>>>>>>>> email: sdiaz_at_[hidden] ; http://www.cesga.es/
>>>>>>>>>>> <image002.jpg>
>>>>>>>>>>> ------------------------------------------------
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> users_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Sergio Díaz Montes
>>>>>>>>> Centro de Supercomputacion de Galicia
>>>>>>>>> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela
>>>>>>>>> (Spain)
>>>>>>>>> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
>>>>>>>>> email: sdiaz_at_[hidden] ; http://www.cesga.es/
>>>>>>>>> <image002.jpg>
>>>>>>>>> ------------------------------------------------
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Sergio Díaz Montes
>>>>>>> Centro de Supercomputacion de Galicia
>>>>>>> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela
>>>>>>> (Spain)
>>>>>>> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
>>>>>>> email: sdiaz_at_[hidden] ; http://www.cesga.es/
>>>>>>> <image002.jpg>
>>>>>>> ------------------------------------------------
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sergio Díaz Montes
>>>>> Centro de Supercomputacion de Galicia
>>>>> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
>>>>> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
>>>>> email: sdiaz_at_[hidden] ; http://www.cesga.es/
>>>>> <mime-attachment.jpeg>
>>>>> ------------------------------------------------
>>>>> _______________________________________________ users mailing list
>>>>> users_at_[hidden] http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>> --
>>>> Sergio Díaz Montes
>>>> Centro de Supercomputacion de Galicia
>>>> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
>>>> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
>>>> email: sdiaz_at_[hidden] ; http://www.cesga.es/
>>>> <image002.jpg>
>>>> ------------------------------------------------
>>>> <mime-attachment.jpeg><image002.jpg>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>> --
>> Sergio Díaz Montes
>> Centro de Supercomputacion de Galicia
>> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
>> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
>> email: sdiaz_at_[hidden] ; http://www.cesga.es/
>> <image002.jpg>
>> ------------------------------------------------
>> <image002.jpg>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
Sergio Díaz Montes
Centro de Supercomputacion de Galicia
Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
email: sdiaz_at_[hidden] ; http://www.cesga.es/
------------------------------------------------



image002.jpg