Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] checkpoint opempi-1.3.3+sge62
From: Sergio Díaz (sdiaz_at_[hidden])
Date: 2009-11-09 08:33:59


Hi Josh,

The OpenMPI version is 1.3.3.

The command ompi-ps doesn't work.

[root_at_compute-3-18 ~]# ompi-ps -j 2726959 -p 16241
[root_at_compute-3-18 ~]# ompi-ps -v -j 2726959 -p 16241
[compute-3-18.local:16254] orte_ps: Acquiring list of HNPs and setting
contact info into RML...
[root_at_compute-3-18 ~]# ompi-ps -v -j 2726959
[compute-3-18.local:16255] orte_ps: Acquiring list of HNPs and setting
contact info into RML...

[root_at_compute-3-18 ~]# ps uaxf | grep sdiaz
root 16260 0.0 0.0 51084 680 pts/0 S+ 13:38 0:00
\_ grep sdiaz
sdiaz 16203 0.0 0.0 53164 1220 ? Ss 13:37 0:00 \_
-bash /opt/cesga/sge62/default/spool/compute-3-18/job_scripts/2726959
sdiaz 16241 0.0 0.0 41028 2480 ? S 13:37 0:00
\_ mpirun -np 2 -am ft-enable-cr ./pi3
sdiaz 16242 0.0 0.0 36484 1840 ? Sl 13:37
0:00 \_ /opt/cesga/sge62/bin/lx24-x86/qrsh -inherit
-nostdin -V compute-3-17.local orted -mca ess env -mca orte_ess_jobid
2769879040 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri
"2769879040.0;tcp://192.168.4.143:57010" -mca mca_base_param_file_prefix
ft-enable-cr -mca mca_base_param_file_path
/opt/cesga/openmpi-1.3.3/share/openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz/mpi_test
-mca mca_base_param_file_path_force /home_no_usc/cesga/sdiaz/mpi_test
sdiaz 16245 0.1 0.0 99464 4616 ? Sl 13:37
0:00 \_ ./pi3

[root_at_compute-3-18 ~]# ompi-ps -n c3-18
[root_at_compute-3-18 ~]# ompi-ps -n compute-3-18
[root_at_compute-3-18 ~]# ompi-ps -n

There is not directory on the /tmp of the node. However, if the
application is run without SGE, the directory is created but if I do
ompi-ps -j MPIRUN_PID, it seems hanged and I interrupt it. Does it take
long time?
what means the option -j of ompi-ps command? isn't it related to a batch
system(like sge, condor...), is it?

Thanks for the ticket. I will follow it.

Talking with Alan, I realized that there are few transport protocols
that are supported. And maybe it is the problem. Currently, SGE is using
qrsh to expand mpi process. I can change this protocol and use ssh. So,
I'm going to test it this afternoon and I will comment to you the results.

Regards,
Sergio

Josh Hursey escribió:
>
> On Oct 28, 2009, at 7:41 AM, Sergio Díaz wrote:
>
>> Hello,
>>
>> I have achieved the checkpoint of an easy program without SGE. Now,
>> I'm trying to do the integration openmpi+sge but I have some
>> problems... When I try to do checkpoint of the mpirun PID, I got an
>> error similar to the error gotten when the PID doesn't exit. The
>> example below.
>
> I do not have any experience with the SGE environment, so I suspect
> that there may something 'special' about the environment that is
> tripping up the ompi-checkpoint tool.
>
> First of all, what version of Open MPI are you using?
>
> Somethings to check:
> - Does 'ompi-ps' work when your application is running?
> - Is there an /tmp/openmpi-sessions-* directory on the node where
> mpirun is currently running? This directory contains information on
> how to connect to the mpirun process from an external tool, if it's
> missing then this could be the cause of the problem.
>
>>
>> Any ideas?
>> Somebody have a script to do it automatic with SGE?. For example I
>> have one to do checkpoint each X seconds with BLCR and non-mpi jobs.
>> It is launched by SGE if you have configured the queue and the ckpt
>> environment.
>
> I do not know of any integration of the Open MPI checkpointing work
> with SGE at the moment.
>
> As far as time triggered checkpointing, I have a feature ticket open
> about this:
> https://svn.open-mpi.org/trac/ompi/ticket/1961
>
> It is not available yet, but in the works.
>
>
>>
>> Is it possible choose the name of the ckpt folder when you do the
>> ompi-checkpoint? I can't find the option to do it.
>
> Not at this time. Though I could see it as a useful feature, and
> shouldn't be too hard to implement. I filed a ticket if you want to
> follow the progress:
> https://svn.open-mpi.org/trac/ompi/ticket/2098
>
> -- Josh
>
>>
>>
>> Regards,
>> Sergio
>>
>>
>> --------------------------------
>>
>> [sdiaz_at_compute-3-17 ~]$ ps auxf
>> ....
>> root 20044 0.0 0.0 4468 1224 ? S 13:28 0:00 \_
>> sge_shepherd-2645150 -bg
>> sdiaz 20072 0.0 0.0 53172 1212 ? Ss 13:28 0:00
>> \_ -bash /opt/cesga/sge62/default/spool/compute-3-17/job_scripts/2645150
>> sdiaz 20112 0.2 0.0 41028 2480 ? S 13:28
>> 0:00 \_ mpirun -np 2 -am ft-enable-cr pi3
>> sdiaz 20113 0.0 0.0 36484 1824 ? Sl 13:28
>> 0:00 \_ /opt/cesga/sge62/bin/lx24-x86/qrsh -inherit
>> -nostdin -V compute-3-18..........
>> sdiaz 20116 1.2 0.0 99464 4616 ? Sl 13:28
>> 0:00 \_ pi3
>>
>>
>> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint 20112
>> [compute-3-17.local:20124] HNP with PID 20112 Not found!
>>
>> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint -s 20112
>> [compute-3-17.local:20135] HNP with PID 20112 Not found!
>>
>> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint -s --term 20112
>> [compute-3-17.local:20136] HNP with PID 20112 Not found!
>>
>> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint --hnp-pid 20112
>> --------------------------------------------------------------------------
>>
>> ompi-checkpoint PID_OF_MPIRUN
>> Open MPI Checkpoint Tool
>>
>> -am <arg0> Aggregate MCA parameter set file list
>> -gmca|--gmca <arg0> <arg1>
>> Pass global MCA parameters that are
>> applicable to
>> all contexts (arg0 is the parameter name;
>> arg1 is
>> the parameter value)
>> -h|--help This help message
>> --hnp-jobid <arg0> This should be the jobid of the HNP whose
>> applications you wish to checkpoint.
>> --hnp-pid <arg0> This should be the pid of the mpirun whose
>> applications you wish to checkpoint.
>> -mca|--mca <arg0> <arg1>
>> Pass context-specific MCA parameters; they are
>> considered global if --gmca is not used and
>> only
>> one context is specified (arg0 is the parameter
>> name; arg1 is the parameter value)
>> -s|--status Display status messages describing the
>> progression
>> of the checkpoint
>> --term Terminate the application after checkpoint
>> -v|--verbose Be Verbose
>> -w|--nowait Do not wait for the application to finish
>> checkpointing before returning
>>
>> --------------------------------------------------------------------------
>>
>> [sdiaz_at_compute-3-17 ~]$ exit
>> logout
>> Connection to c3-17 closed.
>> [sdiaz_at_svgd mpi_test]$ ssh c3-18
>> Last login: Wed Oct 28 13:24:12 2009 from svgd.local
>> -bash-3.00$ ps auxf |grep sdiaz
>>
>> sdiaz 14412 0.0 0.0 1888 560 ? Ss 13:28 0:00
>> \_ /opt/cesga/sge62/utilbin/lx24-x86/qrsh_starter
>> /opt/cesga/sge62/default/spool/compute-3-18/active_jobs/2645150.1/1.compute-3-18
>>
>> sdiaz 14419 0.0 0.0 35728 2260 ? S 13:28
>> 0:00 \_ orted -mca ess env -mca orte_ess_jobid 2295267328
>> -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri
>> 2295267328.0;tcp://192.168.4.144:36596 -mca
>> mca_base_param_file_prefix ft-enable-cr -mca mca_base_param_file_path
>> /opt/cesga/openmpi-1.3.3/share/openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz/mpi_test
>> -mca mca_base_param_file_path_force /home_no_usc/cesga/sdiaz/mpi_test
>> sdiaz 14420 0.0 0.0 99452 4596 ? Sl 13:28
>> 0:00 \_ pi3
>>
>>
>>
>>
>>
>> --
>> Sergio Díaz Montes
>> Centro de Supercomputacion de Galicia
>> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
>> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
>> email: sdiaz_at_[hidden] ; http://www.cesga.es/
>> <image002.jpg>
>> ------------------------------------------------
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
Sergio Díaz Montes
Centro de Supercomputacion de Galicia
Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
email: sdiaz_at_[hidden] ; http://www.cesga.es/
------------------------------------------------



image002.jpg