Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] error in checkpointing in open mpi
From: Joshua Hursey (jjhursey_at_[hidden])
Date: 2009-09-25 12:28:26


On Sep 25, 2009, at 7:10 AM, Mallikarjuna Shastry wrote:

> dear sir
>
> i am sending the details as follows
>
>
> 1. i am using openmpi-1.3.3 and blcr 0.8.2
> 2. i have installed blcr 0.8.2 first under /root/MS
> 3. then i installed openmpi 1.3.3 under /root/MS
> 4 i have configured and installed open mpi as follows
>
> #./configure --with-ft=cr --enable-mpi-threads
> --with-blcr=/usr/local/bin
> --with-blcr-libdir=/usr/local/lib

If you want to enable the C/R thread then you need to specify it. Try
adding '--enable-ft-thread' to you Open MPI configure in addition to
'--enable-mpi-threads'. The C/R thread should help your problem below.

Also it looks like you are specifying the wrong BLCR path. Above you
said that it was installed in '/root/MS' but you are passing '/usr/
local/lib'.

Have you confirmed that you can successfully checkpoint/restart a non-
MPI program on this system with BLCR?

> # make
> # make install
>
> then i added the following to the .bash_profile under home
> directory( i went to home directory by doing cd ~)
>
> /sbin/insmod
> /usr/local/lib/blcr/2.6.23.1-42.fc8/blcr_imports.ko
> /sbin/insmod /usr/local/lib/blcr/2.6.23.1-42.fc8/blcr.ko

Instead of putting this in your .bash_profile, the /sbin/insmod's
should probably be setup to automatically load a boot time. BLCR's
Admin Guide discusses how you can set this up (See section 2.5):
   https://upc-bugs.lbl.gov//blcr/doc/html/BLCR_Admin_Guide.html

> PATH=$PATH:/usr/local/bin
> MANPATH=$MANPATH:/usr/local/man
> LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib

Again if you installed Open MPI and BLCR in /root/MS, then you need to
add that installation path to your environment (e.g., PATH,
LD_LIBRARY_PATH, MANPATH).

>
> then i compiled and run the file arr_add.c as follows
>
> [root_at_localhost examples]# mpicc -o res arr_add.c
> [root_at_localhost examples]# mpirun -np 2 -am ft-enable-cr
> ./res

You really should not ever be running Open MPI as root. Neither Open
MPI nor BLCR require that you be root to use them.

>
> 2 2 2 2 2 2 2 2 2 2
> 2 2 2 2 2 2 2 2 2 2
> 2 2 2 2 2 2 2 2 2 2
> --------------------------------------------------------------------------
> Error: The process with PID 5790 is not checkpointable.
> This could be due to one of
>> the following:
>> - An application with this PID
>> doesn't currently exist
>> - The application with this PID
>> isn't checkpointable
>> - The application with this PID
>> isn't an OPAL application.
>> We were looking for the
>> named files:
>>
>> /tmp/opal_cr_prog_write.5790
>>
>> /tmp/opal_cr_prog_read.5790
>> --------------------------------------------------------------------------
>> [localhost.localdomain:05788] local) Error: Unable to
>> initiate the handshake with peer [[7788,1],1]. -1
>> [localhost.localdomain:05788] [[7788,0],0] ORTE_ERROR_LOG:
>> Error in file snapc_full_global.c at line 567
>> [localhost.localdomain:05788] [[7788,0],0] ORTE_ERROR_LOG:
>> Error in file snapc_full_global.c at line 1054
>
>
> 2 2 2 2 2 2 2 2 2 2
> 2 2 2 2 2 2 2 2 2 2
> 2 2 2 2 2 2 2 2 2 2
> 2 2 2 2 2 2 2 2 2 2
> 2 2 2 2 2 2 2 2 2 2
> 2 2 2 2 2 2 2 2 2 2
> 2 2 2 2 2 2 2 2 2 2
>

I suspect that this is related to your application. Have you tried to
checkpoint/restart a simple example program, something that has a core
loop like (Note the MPI_Barrier is necessary if you are not using the
C/R thread since we need to call into the Open MPI library to check
for a checkpoint):
---------
for(i = 0; i < 100; i++) {
   MPI_Barrier(MPI_COMM_WORLD);
   printf("Counting %d\n", i);
   sleep(1);
}
----------

Per my other message to you on the list:
   http://www.open-mpi.org/community/lists/users/2009/09/10741.php

--------------------
Is your application using SIGUSR1?

This error message indicates that Open MPI's daemons could not
communicate with the application processes. The daemons send SIGUSR1
to the process to initiate the handshake (you can change this signal
with -mca opal_cr_signal). If your application does not respond to the
daemon within a time bound (default 20 sec, though you can change it
with -mca snapc_full_max_wait_time) then this error is printed, and
the checkpoint is aborted.--------------------

-- Josh

>
>
>
> NOTE: the PID of mpirun is 5788
>
> i geve the following command for taking the checkpoint
>
> [root_at_localhost examples]#ompi-checkpoint -s 5788
>
> i got the following output , but it was hanging like this
>
> [localhost.localdomain:05796]
> Requested - Global
> Snapshot Reference: (null)
> [localhost.localdomain:05796]
> Pending -
> Global Snapshot Reference: (null)
> [localhost.localdomain:05796]
> Running -
> Global Snapshot Reference: (null)
>
>
>
> kindly rectify it.
>
> with regards
>
> mallikarjuna shastry
>
>
>
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users