Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] problem restarting multiprocess mpi application
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2010-01-11 16:34:08


On Dec 13, 2009, at 3:57 PM, Kritiraj Sajadah wrote:

> Dear All,
> I am running a simple mpi application which looks as
> follows:
>
> ######################################
>
> #include <mpi.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <signal.h>
>
> int main(int argc, char **argv)
> {
> int rank,size;
>
> MPI_Init(&argc, &argv);
> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> MPI_Comm_size(MPI_COMM_WORLD, &size);
> printf("Hello\n");
> sleep(15);
> printf("Hello again\n" );
> sleep(15);
> printf("Final Hello\n");
> sleep(15);
> printf("bye \n");
> MPI_Finalize();
> return 0;
> }
> #################################
>
> When I run my application as follows, it checkpoint correctly but
> when i try to restart it if gives the following errors:
>
> ######################################
>
> ompi-restart ompi_global_snapshot_380.ckpt
> Hello again
> [sun06:00381] *** Process received signal ***
> [sun06:00381] Signal: Bus error (7)
> [sun06:00381] Signal code: (2)
> [sun06:00381] Failing at address: 0xae7cb054
> [sun06:00381] [ 0] [0xb7f8640c]
> [sun06:00381] [ 1] /home/raj/openmpisof/lib/libopen-pal.so.
> 0(opal_progress+0x123) [0xb7b95456]
> [sun06:00381] [ 2] /home/raj/openmpisof/lib/libopen-pal.so.0
> [0xb7bcb093]
> [sun06:00381] [ 3] /home/raj/openmpisof/lib/libopen-pal.so.0
> [0xb7bcae97]
> [sun06:00381] [ 4] /home/raj/openmpisof/lib/libopen-pal.so.
> 0(opal_crs_blcr_checkpoint+0x187) [0xb7bca69b]
> [sun06:00381] [ 5] /home/raj/openmpisof/lib/libopen-pal.so.
> 0(opal_cr_inc_core+0xc3) [0xb7b970bd]
> [sun06:00381] [ 6] /home/raj/openmpisof/lib/libopen-rte.so.0
> [0xb7cab06f]
> [sun06:00381] [ 7] /home/raj/openmpisof/lib/libopen-pal.so.
> 0(opal_cr_test_if_checkpoint_ready+0x129) [0xb7b96fca]
> [sun06:00381] [ 8] /home/raj/openmpisof/lib/libopen-pal.so.0
> [0xb7b97698]
> [sun06:00381] [ 9] /lib/libpthread.so.0 [0xb7ac4f3b]
> [sun06:00381] [10] /lib/libc.so.6(clone+0x5e) [0xb7a4bbee]
> [sun06:00381] *** End of error message ***
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 399 on node sun06 exited
> on signal 7 (Bus error).
> --------------------------------------------------------------------------
> #####################################################################

This could be caused by a variety of things, including a bad BLCR
installation. :/

Are you sure that your application was between MPI_Init() and
MPI_Finalize() when you checkpointed?

> I am running it as follows:
>
> ################################################################
> mpirun -am ft-enable-cr -np 2 -mca btl ^openib -mca
> snapc_base_global_snapshot_dir /tmp mpisleepbas.
> ################################################################

Try specifying the MCA parameters in your $HOME/.openmpi/mca-
params.conf file.

>
> Once a checkpoint it taken, I have to copy it to the home directory
> and try to restart it.

The manual movement of the checkpoint file is not currently supported.
I filed a bug about it if you want to track it:
   https://svn.open-mpi.org/trac/ompi/ticket/2161

>
> please not that if i used - np 1, it works fine when i restart it.
> The problem is mainly when the application has more than one process
> running.

Are the processes on the same machines or different machines?

-- Josh

>
>
> Any help will be very appreciated
>
>
> Raj
>
>
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users