Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] checkpoint opempi-1.3.3+sge62
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2010-01-11 16:40:28


On Dec 14, 2009, at 12:25 PM, Sergio Díaz wrote:

> Hi Reuti,
>
> Yes, I sent a job with SGE and I checkpointed the mpirun process, by
> hand, entering into the mpi master node. Then I killed the job with
> qdel and after that I did the ompi-restart.
> I will try to integrate with SGE creating a ckpt environment but I
> think that it could be a bit difficult because:
> 1 - when I do checkpoint I can't specify a directory with
> a name like checkpoint_jobid
> 2 - I can't specify the scratch directory and I have to
> use the /tmp instead of SGE's scratch directory.
> 3 - I tried to restart the snapshot and it only works if I
> use the same machinefile. That is, If the job ran in the c3-13 and
> c3-14, I have to restart the job using a machinefile with these two
> nodes.

This is usually caused by prelink'ing interfering with BLCR. See the
BLCR FAQ for how to disable this option:
   https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink

Let me know if that fixes this problem.

Josh

>
> [sdiaz_at_svgd ~]$ ompi-restart -v -machinefile
> mpi_test/lanzar_pi3.sh.po3117822 ompi_global_snapshot_12554.ckpt
> [svgd.cesga.es:28836] Checking for the existence
> of (/home/cesga/sdiaz/ompi_global_snapshot_12554.ckpt)
> [svgd.cesga.es:28836] Restarting from file
> (ompi_global_snapshot_12554.ckpt)
> [svgd.cesga.es:28836] Exec in self
> tiempo 110
> Process 1 :
> compute-3-14.local
> of 2
> tiempo 110
> Process 0 :
> compute-3-13.local
> of 2
>
> --------------------------------------------------------------------------
> mpirun noticed that process rank 1 with PID
> 8477 on node compute-3-15 exited on signal 11 (Segmentation fault).
>
> --------------------------------------------------------------------------
>
> To solve problem 1, there is a feature opened by Josh. (https://svn.open-mpi.org/trac/ompi/ticket/2098
> )
> To solve problem 2, there is a thread in which is talked ([OMPI
> users] Changing location where checkpoints are saved) and also a bug
> opened by Josh. https://svn.open-mpi.org/trac/ompi/ticket/2139 . I
> think that it could work... we will see.
> To solve problem 3, I didn't have time to search it. But if Josh or
> anyone have an idea... please tell to us :-)
>
> Reuti, Did you test it successfully? How do you solve these problems?
>
> Regards,
> Sergio
>
>
> Reuti escribió:
>>
>> Hi,
>>
>> Am 14.12.2009 um 17:05 schrieb Sergio Díaz:
>>
>>> I got a successful checkpoint with a fresh installation and
>>> without use the trunk. I can't understand why it is working now
>>> and before I could do a successful restart... Maybe there was
>>> something wrong in the openmpi installation and then the metadata
>>> was created in a wrong way.
>>> I will test it more and also I will test the trunk.
>>>
>>> Regards,
>>> Sergio
>>>
>>> [sdiaz_at_compute-3-13 ~]$ ompi-restart -machinefile mpi_test/
>>> lanzar_pi3.sh.po3117822 ompi_global_snapshot_12554.ckpt
>>> tiempo 110
>>> Process 1 :
>>> compute-3-14.local
>>> of 2
>>> tiempo 110
>>> Process 0 :
>>> compute-3-13.local
>>> of 2
>>> tiempo 120
>>> Process 1 :
>>> compute-3-14.local
>>> of 2
>>> tiempo 120
>>> Process 0 :
>>> compute-3-13.local
>>> ...
>>> ...
>>>
>>> [sdiaz_at_compute-3-14 ~]$ ps auxf |grep sdiaz
>>> sdiaz 26273 0.0 0.0 34676 1668 ? Ss 15:58 0:00
>>> orted --daemonize -mca ess env -mca orte_ess_jobid 1739128832 -mca
>>
>> in a Tight Integration into SGE the daemon should get the argument
>> --no-daemonize. Are you restarting a job on the command line, which
>> ran before under SGE's supervision?
>>
>> -- Reuti
>>
>>
>>> orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri
>>> 1739128832.0;tcp://192.168.4.148:45551 -mca
>>> mca_base_param_file_prefix ft-enable-cr -mca
>>> mca_base_param_file_path /opt/cesga/openmpi-1.3.3_bis/share/
>>> openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz -mca
>>> mca_base_param_file_path_force /home_no_usc/cesga/sdiaz
>>> sdiaz 26274 0.1 0.0 15984 504 ? Sl 15:58 0:00 \_
>>> cr_restart /home/cesga/sdiaz/ompi_global_snapshot_12554.ckpt/0/
>>> opal_snapshot_1.ckpt/ompi_blcr_context.26047
>>> sdiaz 26047 1.5 0.0 99460 3624 ? Sl 15:58
>>> 0:00 \_ ./pi3
>>>
>>> [sdiaz_at_compute-3-13 ~]$ ps auxf |grep sdiaz
>>> root 12878 0.0 0.0 90260 3000 pts/0 S 15:55 0:00
>>> | \_ su - sdiaz
>>> sdiaz 12880 0.0 0.0 53432 1512 pts/0 S 15:55 0:00
>>> | \_ -bash
>>> sdiaz 13070 0.3 0.0 39988 2500 pts/0 S+ 15:58 0:00
>>> | \_ mpirun -am ft-enable-cr --default-hostfile
>>> mpi_test/lanzar_pi3.sh.po3117822 --app /home/cesga/sdiaz/
>>> ompi_global_snapshot_12554.ckpt/restart-appfile
>>> sdiaz 13073 0.0 0.0 15988 508 pts/0 Sl+ 15:58 0:00
>>> | \_ cr_restart /home/cesga/sdiaz/
>>> ompi_global_snapshot_12554.ckpt/0/opal_snapshot_0.ckpt/
>>> ompi_blcr_context.12558
>>> sdiaz 12558 0.2 0.0 99464 3616 pts/0 Sl+ 15:58 0:00
>>> | \_ ./pi3
>>>
>>>
>>> Sergio Díaz escribió:
>>>>
>>>> Hi Josh
>>>>
>>>> Here you go the file.
>>>>
>>>> I will try to apply the trunk but I think that I broke-up my
>>>> openmpi installation doing "something" and I don't know what :-
>>>> ( . I was modifying the mca parameters...
>>>> When I send a job, the orted daemon expanded in the SLAVE host is
>>>> launched in a bucle till they spend all the reserved memory.
>>>> It is very strange so I will compile it again, I will reproduce
>>>> the bug and then I will test the trunk.
>>>>
>>>> Thanks a lot for the support and tickets opened.
>>>> Sergio
>>>>
>>>>
>>>> sdiaz 30279 0.0 0.0 1888 560 ? Ds 12:54
>>>> 0:00 \_ /opt/cesga/sge62/utilbin/lx24-x86/qrsh_starter /opt/
>>>> cesga/sge62/default/spool/compute
>>>> sdiaz 30286 0.0 0.0 52772 1188 ? D 12:54
>>>> 0:00 \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted -
>>>> mca ess env -mca orte_ess_jobid 219
>>>> sdiaz 30322 0.0 0.0 52772 1188 ? S 12:54
>>>> 0:00 \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted
>>>> sdiaz 30358 0.0 0.0 52772 1188 ? D 12:54
>>>> 0:00 \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/
>>>> orted
>>>> sdiaz 30394 0.0 0.0 52772 1188 ? D 12:54
>>>> 0:00 \_ /bin/bash /opt/cesga/openmpi-1.3.3/
>>>> bin/orted
>>>> sdiaz 30430 0.0 0.0 52772 1188 ? D 12:54
>>>> 0:00 \_ /bin/bash /opt/cesga/
>>>> openmpi-1.3.3/bin/orted
>>>> sdiaz 30466 0.0 0.0 52772 1188 ? D 12:54
>>>> 0:00 \_ /bin/bash /opt/cesga/
>>>> openmpi-1.3.3/bin/orted
>>>> sdiaz 30502 0.0 0.0 52772 1188 ? D 12:54
>>>> 0:00 \_ /bin/bash /opt/cesga/
>>>> openmpi-1.3.3/bin/orted
>>>> sdiaz 30538 0.0 0.0 52772 1188 ? D 12:54
>>>> 0:00 \_ /bin/bash /opt/cesga/
>>>> openmpi-1.3.3/bin/orted
>>>> sdiaz 30574 0.0 0.0 52772 1188 ? D 12:54
>>>> 0:00 \_ /bin/bash /opt/
>>>> cesga/openmpi-1.3.3/bin/orted
>>>> ....
>>>>
>>>>
>>>>
>>>> Josh Hursey escribió:
>>>>>
>>>>>
>>>>> On Nov 12, 2009, at 10:54 AM, Sergio Díaz wrote:
>>>>>
>>>>>> Hi Josh,
>>>>>>
>>>>>> You were right. The main problem was the /tmp. SGE uses a
>>>>>> scratch directory in which the jobs have temporary files.
>>>>>> Setting TMPDIR to /tmp, checkpoint works!
>>>>>> However, when I try to restart it... I got the following error
>>>>>> (see ERROR1). Option -v agrees these lines (see ERRO2).
>>>>>
>>>>> It is concerning that ompi-restart is segfault'ing when it
>>>>> errors out. The error message is being generated between the
>>>>> launch of the opal-restart starter command and when we try to
>>>>> exec(cr_restart). Usually the failure is related to a corruption
>>>>> of the metadata stored in the checkpoint.
>>>>>
>>>>> Can you send me the file below:
>>>>> ompi_global_snapshot_28454.ckpt/0/opal_snapshot_0.ckpt/
>>>>> snapshot_meta.data
>>>>>
>>>>> I was able to reproduce the segv (at least I think it is the
>>>>> same one). We failed to check the validity of a string when we
>>>>> parse the metadata. I committed a fix to the trunk in r22290,
>>>>> and requested that the fix be moved to the v1.4 and v1.5
>>>>> branches. If you are interested in seeing when they get applied
>>>>> you can follow the following tickets:
>>>>> https://svn.open-mpi.org/trac/ompi/ticket/2140
>>>>> https://svn.open-mpi.org/trac/ompi/ticket/2141
>>>>>
>>>>> Can you try the trunk to see if the problem goes away? The
>>>>> development trunk and v1.5 series have a bunch of improvements
>>>>> to the C/R functionality that were never brought over the v1.3/
>>>>> v1.4 series.
>>>>>
>>>>>>
>>>>>> I was trying to use ssh instead of rsh but I was impossible. By
>>>>>> default it should use ssh and if it finds a problem, it will
>>>>>> use rsh. It seems that ssh doesn't work because always use rsh.
>>>>>> If I change this MCA parameter, It still uses rsh.
>>>>>> If I set OMPI_MCA_plm_rsh_disable_qrsh variable to 1, It try to
>>>>>> use ssh and doesn't works. I got --> "bash: orted: command not
>>>>>> found" and the mpi process dies.
>>>>>> The command which try to execute is the following and I haven't
>>>>>> found yet the reason why this command doesn't found orted
>>>>>> because I set the /etc/bashrc in order to get always the right
>>>>>> path and I have the right path into my application. (see ERROR4).
>>>>>
>>>>> This seems like an SGE specific issue, so a bit out of my
>>>>> domain. Maybe others have suggestions here.
>>>>>
>>>>> -- Josh
>>>>>>
>>>>>>
>>>>>> Many thanks!,
>>>>>> Sergio
>>>>>>
>>>>>> P.S. Sorry about these long emails. I just try to show you
>>>>>> useful information to identify my problems.
>>>>>>
>>>>>>
>>>>>> ERROR 1
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>> > [sdiaz_at_compute-3-18 ~]$ ompi-restart
>>>>>> ompi_global_snapshot_28454.ckpt
>>>>>> >
>>>>>> --------------------------------------------------------------------------
>>>>>> > Error: Unable to obtain the proper restart command to restart
>>>>>> from the
>>>>>> > checkpoint file (opal_snapshot_0.ckpt). Returned -1.
>>>>>> >
>>>>>> >
>>>>>> --------------------------------------------------------------------------
>>>>>> >
>>>>>> --------------------------------------------------------------------------
>>>>>> > Error: Unable to obtain the proper restart command to restart
>>>>>> from the
>>>>>> > checkpoint file (opal_snapshot_1.ckpt). Returned -1.
>>>>>> >
>>>>>> >
>>>>>> --------------------------------------------------------------------------
>>>>>> > [compute-3-18:28792] *** Process received signal ***
>>>>>> > [compute-3-18:28792] Signal: Segmentation fault (11)
>>>>>> > [compute-3-18:28792] Signal code: (128)
>>>>>> > [compute-3-18:28792] Failing at address: (nil)
>>>>>> > [compute-3-18:28792] [ 0] /lib64/tls/libpthread.so.0
>>>>>> [0x33bbf0c430]
>>>>>> > [compute-3-18:28792] [ 1] /lib64/tls/libc.so.6(__libc_free
>>>>>> +0x25) [0x33bb669135]
>>>>>> > [compute-3-18:28792] [ 2] /opt/cesga/openmpi-1.3.3/lib/
>>>>>> libopen-pal.so.0(opal_argv_free+0x2e) [0x2a95586658]
>>>>>> > [compute-3-18:28792] [ 3] /opt/cesga/openmpi-1.3.3/lib/
>>>>>> libopen-pal.so.0(opal_event_fini+0x1e) [0x2a9557906e]
>>>>>> > [compute-3-18:28792] [ 4] /opt/cesga/openmpi-1.3.3/lib/
>>>>>> libopen-pal.so.0(opal_finalize+0x36) [0x2a9556bcfa]
>>>>>> > [compute-3-18:28792] [ 5] opal-restart [0x40312a]
>>>>>> > [compute-3-18:28792] [ 6] /lib64/tls/libc.so.
>>>>>> 6(__libc_start_main+0xdb) [0x33bb61c3fb]
>>>>>> > [compute-3-18:28792] [ 7] opal-restart [0x40272a]
>>>>>> > [compute-3-18:28792] *** End of error message ***
>>>>>> > [compute-3-18:28793] *** Process received signal ***
>>>>>> > [compute-3-18:28793] Signal: Segmentation fault (11)
>>>>>> > [compute-3-18:28793] Signal code: (128)
>>>>>> > [compute-3-18:28793] Failing at address: (nil)
>>>>>> > [compute-3-18:28793] [ 0] /lib64/tls/libpthread.so.0
>>>>>> [0x33bbf0c430]
>>>>>> > [compute-3-18:28793] [ 1] /lib64/tls/libc.so.6(__libc_free
>>>>>> +0x25) [0x33bb669135]
>>>>>> > [compute-3-18:28793] [ 2] /opt/cesga/openmpi-1.3.3/lib/
>>>>>> libopen-pal.so.0(opal_argv_free+0x2e) [0x2a95586658]
>>>>>> > [compute-3-18:28793] [ 3] /opt/cesga/openmpi-1.3.3/lib/
>>>>>> libopen-pal.so.0(opal_event_fini+0x1e) [0x2a9557906e]
>>>>>> > [compute-3-18:28793] [ 4] /opt/cesga/openmpi-1.3.3/lib/
>>>>>> libopen-pal.so.0(opal_finalize+0x36) [0x2a9556bcfa]
>>>>>> > [compute-3-18:28793] [ 5] opal-restart [0x40312a]
>>>>>> > [compute-3-18:28793] [ 6] /lib64/tls/libc.so.
>>>>>> 6(__libc_start_main+0xdb) [0x33bb61c3fb]
>>>>>> > [compute-3-18:28793] [ 7] opal-restart [0x40272a]
>>>>>> > [compute-3-18:28793] *** End of error message ***
>>>>>> >
>>>>>> --------------------------------------------------------------------------
>>>>>> > mpirun noticed that process rank 0 with PID 28792 on node
>>>>>> compute-3-18.local exited on signal 11 (Segmentation fault).
>>>>>> >
>>>>>> --------------------------------------------------------------------------
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>
>>>>>>
>>>>>> ERROR 2
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>> > [sdiaz_at_compute-3-18 ~]$ ompi-restart -v
>>>>>> ompi_global_snapshot_28454.ckpt
>>>>>> >[compute-3-18.local:28941] Checking for the existence of (/
>>>>>> home/cesga/sdiaz/ompi_global_snapshot_28454.ckpt)
>>>>>> > [compute-3-18.local:28941] Restarting from file
>>>>>> (ompi_global_snapshot_28454.ckpt)
>>>>>> > [compute-3-18.local:28941] Exec in self
>>>>>> > .......
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>
>>>>>>
>>>>>> ERROR3
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>> >[sdiaz_at_compute-3-18 ~]$ ompi_info --all|grep "plm_rsh_agent"
>>>>>> > How many plm_rsh_agent instances to invoke
>>>>>> concurrently (must be > 0)
>>>>>> > MCA plm: parameter "plm_rsh_agent" (current value:
>>>>>> "ssh : rsh", data source: default value, synonyms: pls_rsh_agent)
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>
>>>>>> ERROR4
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>> >/usr/bin/ssh -x compute-3-17.local orted --debug-daemons -mca
>>>>>> ess env -mca orte_ess_jobid 2152464384 -mca orte_ess_vpid 1 -
>>>>>> mca orte_ess_num_procs 2 --hnp-uri >"2152464384.0;tcp://
>>>>>> 192.168.4.143:59176" -mca mca_base_param_file_prefix ft-enable-
>>>>>> cr -mca mca_base_param_file_path >/opt/cesga/openmpi-1.3.3/
>>>>>> share/openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz/mpi_test
>>>>>> -mca mca_base_param_file_path_force /home_no_usc/cesga/sdiaz/
>>>>>> mpi_test
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Josh Hursey escribió:
>>>>>>>
>>>>>>> On Nov 9, 2009, at 5:33 AM, Sergio Díaz wrote:
>>>>>>>
>>>>>>>
>>>>>>>> Hi Josh,
>>>>>>>>
>>>>>>>> The OpenMPI version is 1.3.3.
>>>>>>>>
>>>>>>>> The command ompi-ps doesn't work.
>>>>>>>>
>>>>>>>> [root_at_compute-3-18 ~]# ompi-ps -j 2726959 -p 16241
>>>>>>>> [root_at_compute-3-18 ~]# ompi-ps -v -j 2726959 -p 16241
>>>>>>>> [compute-3-18.local:16254] orte_ps: Acquiring list of HNPs
>>>>>>>> and setting contact info into RML...
>>>>>>>> [root_at_compute-3-18 ~]# ompi-ps -v -j 2726959
>>>>>>>> [compute-3-18.local:16255] orte_ps: Acquiring list of HNPs
>>>>>>>> and setting contact info into RML...
>>>>>>>>
>>>>>>>> [root_at_compute-3-18 ~]# ps uaxf | grep sdiaz
>>>>>>>> root 16260 0.0 0.0 51084 680 pts/0 S+ 13:38
>>>>>>>> 0:00 \_ grep sdiaz
>>>>>>>> sdiaz 16203 0.0 0.0 53164 1220 ? Ss 13:37
>>>>>>>> 0:00 \_ -bash /opt/cesga/sge62/default/spool/
>>>>>>>> compute-3-18/job_scripts/2726959
>>>>>>>> sdiaz 16241 0.0 0.0 41028 2480 ? S 13:37
>>>>>>>> 0:00 \_ mpirun -np 2 -am ft-enable-cr ./pi3
>>>>>>>> sdiaz 16242 0.0 0.0 36484 1840 ? Sl 13:37
>>>>>>>> 0:00 \_ /opt/cesga/sge62/bin/lx24-x86/qrsh -
>>>>>>>> inherit -nostdin -V compute-3-17.local orted -mca ess env -
>>>>>>>> mca orte_ess_jobid 2769879040 -mca orte_ess_vpid 1 -mca
>>>>>>>> orte_ess_num_procs 2 --hnp-uri "2769879040.0;tcp://
>>>>>>>> 192.168.4.143:57010" -mca mca_base_param_file_prefix ft-
>>>>>>>> enable-cr -mca mca_base_param_file_path /opt/cesga/
>>>>>>>> openmpi-1.3.3/share/openmpi/amca-param-sets:/home_no_usc/
>>>>>>>> cesga/sdiaz/mpi_test -mca mca_base_param_file_path_force /
>>>>>>>> home_no_usc/cesga/sdiaz/mpi_test
>>>>>>>> sdiaz 16245 0.1 0.0 99464 4616 ? Sl 13:37
>>>>>>>> 0:00 \_ ./pi3
>>>>>>>>
>>>>>>>> [root_at_compute-3-18 ~]# ompi-ps -n c3-18
>>>>>>>> [root_at_compute-3-18 ~]# ompi-ps -n compute-3-18
>>>>>>>> [root_at_compute-3-18 ~]# ompi-ps -n
>>>>>>>>
>>>>>>>> There is not directory on the /tmp of the node. However, if
>>>>>>>> the application is run without SGE, the directory is created
>>>>>>>>
>>>>>>> This may be the core of the problem. ompi-ps and other command
>>>>>>> line tools (e.g., ompi-checkpoint) look for the Open MPI
>>>>>>> session directory in /tmp in order to find the connection
>>>>>>> information to connect to the mpirun process (internally
>>>>>>> called the HNP or Head Node Process).
>>>>>>>
>>>>>>> Can you change the location of the temporary directory in SGE?
>>>>>>> The temporary directory is usually set via an environment
>>>>>>> variable (e.g., TMPDIR, or TMP). So removing the environment
>>>>>>> variable or setting it to /tmp might help.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> but if I do ompi-ps -j MPIRUN_PID, it seems hanged and I
>>>>>>>> interrupt it. Does it take long time?
>>>>>>>>
>>>>>>> It should not take a long time. It is just querying the mpirun
>>>>>>> process for state information.
>>>>>>>
>>>>>>>
>>>>>>>> what means the option -j of ompi-ps command? isn't it related
>>>>>>>> to a batch system(like sge, condor...), is it?
>>>>>>>>
>>>>>>> The '-j' option allows the user to specify the Open MPI jobid.
>>>>>>> This is completely different than the jobid provided by the
>>>>>>> batch system. In general, users should not need to specify the
>>>>>>> -j option. It is useful when you have multiple Open MPI jobs,
>>>>>>> and want a summary of just one of them.
>>>>>>>
>>>>>>>
>>>>>>>> Thanks for the ticket. I will follow it.
>>>>>>>>
>>>>>>>> Talking with Alan, I realized that there are few transport
>>>>>>>> protocols that are supported. And maybe it is the problem.
>>>>>>>> Currently, SGE is using qrsh to expand mpi process. I can
>>>>>>>> change this protocol and use ssh. So, I'm going to test it
>>>>>>>> this afternoon and I will comment to you the results.
>>>>>>>>
>>>>>>> Try 'ssh' and see if that helps. I suspect the problem is with
>>>>>>> the session directory location though.
>>>>>>>
>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Sergio
>>>>>>>>
>>>>>>>>
>>>>>>>> Josh Hursey escribió:
>>>>>>>>
>>>>>>>>> On Oct 28, 2009, at 7:41 AM, Sergio Díaz wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> I have achieved the checkpoint of an easy program without
>>>>>>>>>> SGE. Now, I'm trying to do the integration openmpi+sge but
>>>>>>>>>> I have some problems... When I try to do checkpoint of the
>>>>>>>>>> mpirun PID, I got an error similar to the error gotten when
>>>>>>>>>> the PID doesn't exit. The example below.
>>>>>>>>>>
>>>>>>>>> I do not have any experience with the SGE environment, so I
>>>>>>>>> suspect that there may something 'special' about the
>>>>>>>>> environment that is tripping up the ompi-checkpoint tool.
>>>>>>>>>
>>>>>>>>> First of all, what version of Open MPI are you using?
>>>>>>>>>
>>>>>>>>> Somethings to check:
>>>>>>>>> - Does 'ompi-ps' work when your application is running?
>>>>>>>>> - Is there an /tmp/openmpi-sessions-* directory on the node
>>>>>>>>> where mpirun is currently running? This directory contains
>>>>>>>>> information on how to connect to the mpirun process from an
>>>>>>>>> external tool, if it's missing then this could be the cause
>>>>>>>>> of the problem.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Any ideas?
>>>>>>>>>> Somebody have a script to do it automatic with SGE?. For
>>>>>>>>>> example I have one to do checkpoint each X seconds with
>>>>>>>>>> BLCR and non-mpi jobs. It is launched by SGE if you have
>>>>>>>>>> configured the queue and the ckpt environment.
>>>>>>>>>>
>>>>>>>>> I do not know of any integration of the Open MPI
>>>>>>>>> checkpointing work with SGE at the moment.
>>>>>>>>>
>>>>>>>>> As far as time triggered checkpointing, I have a feature
>>>>>>>>> ticket open about this:
>>>>>>>>> https://svn.open-mpi.org/trac/ompi/ticket/1961
>>>>>>>>>
>>>>>>>>> It is not available yet, but in the works.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Is it possible choose the name of the ckpt folder when you
>>>>>>>>>> do the ompi-checkpoint? I can't find the option to do it.
>>>>>>>>>>
>>>>>>>>> Not at this time. Though I could see it as a useful feature,
>>>>>>>>> and shouldn't be too hard to implement. I filed a ticket if
>>>>>>>>> you want to follow the progress:
>>>>>>>>> https://svn.open-mpi.org/trac/ompi/ticket/2098
>>>>>>>>>
>>>>>>>>> -- Josh
>>>>>>>>>> Regards,
>>>>>>>>>> Sergio
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --------------------------------
>>>>>>>>>>
>>>>>>>>>> [sdiaz_at_compute-3-17 ~]$ ps auxf
>>>>>>>>>> ....
>>>>>>>>>> root 20044 0.0 0.0 4468 1224 ? S 13:28
>>>>>>>>>> 0:00 \_ sge_shepherd-2645150 -bg
>>>>>>>>>> sdiaz 20072 0.0 0.0 53172 1212 ? Ss 13:28
>>>>>>>>>> 0:00 \_ -bash /opt/cesga/sge62/default/spool/
>>>>>>>>>> compute-3-17/job_scripts/2645150
>>>>>>>>>> sdiaz 20112 0.2 0.0 41028 2480 ? S 13:28
>>>>>>>>>> 0:00 \_ mpirun -np 2 -am ft-enable-cr pi3
>>>>>>>>>> sdiaz 20113 0.0 0.0 36484 1824 ? Sl 13:28
>>>>>>>>>> 0:00 \_ /opt/cesga/sge62/bin/lx24-x86/qrsh -
>>>>>>>>>> inherit -nostdin -V compute-3-18..........
>>>>>>>>>> sdiaz 20116 1.2 0.0 99464 4616 ? Sl 13:28
>>>>>>>>>> 0:00 \_ pi3
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint 20112
>>>>>>>>>> [compute-3-17.local:20124] HNP with PID 20112 Not found!
>>>>>>>>>>
>>>>>>>>>> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint -s 20112
>>>>>>>>>> [compute-3-17.local:20135] HNP with PID 20112 Not found!
>>>>>>>>>>
>>>>>>>>>> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint -s --term 20112
>>>>>>>>>> [compute-3-17.local:20136] HNP with PID 20112 Not found!
>>>>>>>>>>
>>>>>>>>>> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint --hnp-pid 20112
>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>> ompi-checkpoint PID_OF_MPIRUN
>>>>>>>>>> Open MPI Checkpoint Tool
>>>>>>>>>>
>>>>>>>>>> -am <arg0> Aggregate MCA parameter set file
>>>>>>>>>> list
>>>>>>>>>> -gmca|--gmca <arg0> <arg1>
>>>>>>>>>> Pass global MCA parameters that
>>>>>>>>>> are applicable to
>>>>>>>>>> all contexts (arg0 is the
>>>>>>>>>> parameter name; arg1 is
>>>>>>>>>> the parameter value)
>>>>>>>>>> -h|--help This help message
>>>>>>>>>> --hnp-jobid <arg0> This should be the jobid of the
>>>>>>>>>> HNP whose
>>>>>>>>>> applications you wish to checkpoint.
>>>>>>>>>> --hnp-pid <arg0> This should be the pid of the
>>>>>>>>>> mpirun whose
>>>>>>>>>> applications you wish to checkpoint.
>>>>>>>>>> -mca|--mca <arg0> <arg1>
>>>>>>>>>> Pass context-specific MCA
>>>>>>>>>> parameters; they are
>>>>>>>>>> considered global if --gmca is not
>>>>>>>>>> used and only
>>>>>>>>>> one context is specified (arg0 is
>>>>>>>>>> the parameter
>>>>>>>>>> name; arg1 is the parameter value)
>>>>>>>>>> -s|--status Display status messages describing
>>>>>>>>>> the progression
>>>>>>>>>> of the checkpoint
>>>>>>>>>> --term Terminate the application after
>>>>>>>>>> checkpoint
>>>>>>>>>> -v|--verbose Be Verbose
>>>>>>>>>> -w|--nowait Do not wait for the application to
>>>>>>>>>> finish
>>>>>>>>>> checkpointing before returning
>>>>>>>>>>
>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>> [sdiaz_at_compute-3-17 ~]$ exit
>>>>>>>>>> logout
>>>>>>>>>> Connection to c3-17 closed.
>>>>>>>>>> [sdiaz_at_svgd mpi_test]$ ssh c3-18
>>>>>>>>>> Last login: Wed Oct 28 13:24:12 2009 from svgd.local
>>>>>>>>>> -bash-3.00$ ps auxf |grep sdiaz
>>>>>>>>>>
>>>>>>>>>> sdiaz 14412 0.0 0.0 1888 560 ? Ss 13:28
>>>>>>>>>> 0:00 \_ /opt/cesga/sge62/utilbin/lx24-x86/
>>>>>>>>>> qrsh_starter /opt/cesga/sge62/default/spool/compute-3-18/
>>>>>>>>>> active_jobs/2645150.1/1.compute-3-18
>>>>>>>>>> sdiaz 14419 0.0 0.0 35728 2260 ? S 13:28
>>>>>>>>>> 0:00 \_ orted -mca ess env -mca orte_ess_jobid
>>>>>>>>>> 2295267328 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --
>>>>>>>>>> hnp-uri 2295267328.0;tcp://192.168.4.144:36596 -mca
>>>>>>>>>> mca_base_param_file_prefix ft-enable-cr -mca
>>>>>>>>>> mca_base_param_file_path /opt/cesga/openmpi-1.3.3/share/
>>>>>>>>>> openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz/mpi_test -
>>>>>>>>>> mca mca_base_param_file_path_force /home_no_usc/cesga/sdiaz/
>>>>>>>>>> mpi_test
>>>>>>>>>> sdiaz 14420 0.0 0.0 99452 4596 ? Sl 13:28
>>>>>>>>>> 0:00 \_ pi3
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Sergio Díaz Montes
>>>>>>>>>> Centro de Supercomputacion de Galicia
>>>>>>>>>> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de
>>>>>>>>>> Compostela (Spain)
>>>>>>>>>> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
>>>>>>>>>> email: sdiaz_at_[hidden] ; http://www.cesga.es/
>>>>>>>>>> <image002.jpg>
>>>>>>>>>> ------------------------------------------------
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> users_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>
>>>>>>>>>
>>>>>>>> --
>>>>>>>> Sergio Díaz Montes
>>>>>>>> Centro de Supercomputacion de Galicia
>>>>>>>> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela
>>>>>>>> (Spain)
>>>>>>>> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
>>>>>>>> email: sdiaz_at_[hidden] ; http://www.cesga.es/
>>>>>>>> <image002.jpg>
>>>>>>>> ------------------------------------------------
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Sergio Díaz Montes
>>>>>> Centro de Supercomputacion de Galicia
>>>>>> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela
>>>>>> (Spain)
>>>>>> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
>>>>>> email: sdiaz_at_[hidden] ; http://www.cesga.es/
>>>>>> <image002.jpg>
>>>>>> ------------------------------------------------
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>
>>>>
>>>> --
>>>> Sergio Díaz Montes
>>>> Centro de Supercomputacion de Galicia
>>>> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela
>>>> (Spain)
>>>> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
>>>> email: sdiaz_at_[hidden] ; http://www.cesga.es/
>>>> <mime-attachment.jpeg>
>>>> ------------------------------------------------
>>>> _______________________________________________ users mailing
>>>> list users_at_[hidden] http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> --
>>> Sergio Díaz Montes
>>> Centro de Supercomputacion de Galicia
>>> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
>>> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
>>> email: sdiaz_at_[hidden] ; http://www.cesga.es/
>>> <image002.jpg>
>>> ------------------------------------------------
>>> <mime-attachment.jpeg><image002.jpg>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> --
> Sergio Díaz Montes
> Centro de Supercomputacion de Galicia
> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
> email: sdiaz_at_[hidden] ; http://www.cesga.es/
> <image002.jpg>
> ------------------------------------------------
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users