Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] error in checkpointing an mpi application
From: Constantinos Makassikis (cmakassikis_at_[hidden])
Date: 2009-10-01 03:21:36


Hi,

from what you describe below, seems as if you did not configure well
OpenMPI.

You issued

./configure --with-ft=cr --enable-mpi-threads --with-blcr=/usr/local/bin --with-blcr-libdir=/usr/local/lib

while according to the installation paths you gave it should have been
more like

./configure --with-ft=cr --enable-mpi-threads --with-blcr=/root/MS --with-blcr-libdir=/root/MS/lib

Apart from that, if you wish to have BLCR modules loaded at start up of
your machine, a simple way is to add the following lines in rc.local
This file is somewhere in /etc: the exact location can vary from one linux
distribution to another (e.g.: /etc/rc.d/rc.local or /etc/rc.local)

/sbin/insmod /usr/local/lib/blcr/2.6.23.1-42.fc8/blcr_imports.ko
/sbin/insmod /usr/local/lib/blcr/2.6.23.1-42.fc8/blcr.ko

Just in case, if you have multiple MPIs installed, you can check which
you are using with the following command:

which mpirun

HTH,

--
Constantinos
Mallikarjuna Shastry wrote:
>  dear sir
>
>
> i am sending the details as follows
>
>
> 1. i am using openmpi-1.3.3 and blcr 0.8.2 
> 2. i have installed blcr 0.8.2 first under /root/MS
> 3. then i installed openmpi 1.3.3 under /root/MS
> 4 i have configured and installed open mpi as follows
>
> #./configure --with-ft=cr --enable-mpi-threads --with-blcr=/usr/local/bin --with-blcr-libdir=/usr/local/lib
> # make 
> # make install
>
> then i added the following to the .bash_profile under home directory( i went to home directory by doing cd ~)
>
> /sbin/insmod /usr/local/lib/blcr/2.6.23.1-42.fc8/blcr_imports.ko 
> /sbin/insmod /usr/local/lib/blcr/2.6.23.1-42.fc8/blcr.ko 
> PATH=$PATH:/usr/local/bin
> MANPATH=$MANPATH:/usr/local/man
> LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib
>
> then i compiled and run the file arr_add.c as follows
>
> [root_at_localhost examples]# mpicc -o res arr_add.c
> [root_at_localhost examples]# mpirun -np 2 -am ft-enable-cr ./res
>
> 2       2       2       2       2       2       2       2       2       2
> 2       2       2       2       2       2       2       2       2       2
> 2       2       2       2       2       2       2       2       2       2
> --------------------------------------------------------------------------
> Error: The process with PID 5790 is not checkpointable.
>        This could be due to one of the following:
>         - An application with this PID doesn't currently exist
>         - The application with this PID isn't checkpointable
>         - The application with this PID isn't an OPAL application.
>        We were looking for the named files:
>          /tmp/opal_cr_prog_write.5790
>          /tmp/opal_cr_prog_read.5790
> --------------------------------------------------------------------------
> [localhost.localdomain:05788] local) Error: Unable to initiate the handshake with peer [[7788,1],1]. -1
> [localhost.localdomain:05788] [[7788,0],0] ORTE_ERROR_LOG: Error in file snapc_full_global.c at line 567
> [localhost.localdomain:05788] [[7788,0],0] ORTE_ERROR_LOG: Error in file snapc_full_global.c at line 1054
> 2       2       2       2       2       2       2       2       2       2
> 2       2       2       2       2       2       2       2       2       2
> 2       2       2       2       2       2       2       2       2       2
> 2       2       2       2       2       2       2       2       2       2
> 2       2       2       2       2       2       2       2       2       2
> 2       2       2       2       2       2       2       2       2       2
>
>
> NOTE: the PID of mpirun is 5788
>
> i geve the following command for taking the checkpoint
>
> [root_at_localhost examples]#ompi-checkpoint -s 5788
>
> i got the following output , but it was hanging like this
>
> [localhost.localdomain:05796]                 Requested - Global Snapshot Reference: (null)
> [localhost.localdomain:05796]                   Pending - Global Snapshot Reference: (null)
> [localhost.localdomain:05796]                   Running - Global Snapshot Reference: (null)
>
>
> can anybody resolve this problem
> kindly rectify it.
>
>
> with regards
>
> mallikarjuna shastry
>
>
>
>       
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>