Dear OMPI users,

 

I configured and installed OpenMPI-1.4.2 and BLCR-0.8.2. (blade01 每 blade10, nfs)

BLCR configure script: ./configure 每prefix=/opt/blcr 每enable-static

After the installation, I can see the &blcr* module loaded correctly (lsmod | grep blcr). And I can also run &cr_run*, &cr_checkpoint*, &cr_restart* to C/R the examples correctly under /blcr/examples/.

Then, OMPI configure script is: ./configure 每prefix=/opt/ompi 每with-ft=cr 每with-blcr=/opt/blcr 每enable-ft-thread 每enable-mpi-threads 每enable-static

The installation is okay too.

 

Then here comes the problem.

On one node:

         mpirun -np 2 ./hello_c.c

         mpirun -np 2 每am ft-enable-cr ./hello_c.c

         are both okay.

On two nodes(blade01, blade02):

         mpirun 每np 2 每machinefile mf ./hello_c.c  OK.

mpirun 每np 2 每machinefile mf 每am ft-enable-cr ./hello_c.c ERROR. Listed below:

 

*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[blade02:28896] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
It looks like opal_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_cr_init() failed failed
  --> Returned value -1 instead of OPAL_SUCCESS
--------------------------------------------------------------------------
[blade02:28896] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in file runtime/orte_init.c at line 77
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: orte_init failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------

 

I have no idea about the error. Our blades use nfs, does it matter? Can anyone help me solve the problem? I really appreciate it. Thank you.

 

btw, similar error like:

※Oops, cr_init() failed (the initialization call to the BLCR checkpointing system). Abort in despair.

The crmpi SSI subsystem failed to initialized modules successfully during MPI_INIT. This is a fatal error; I must abort.§ occurs when I use LAM/MPI + BLCR.

 

Regards

 

whchen