Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] OpenMPI with BLCR runtime problem
From: ³ÂÎÄºÆ (whchen_at_[hidden])
Date: 2010-08-24 10:27:06


Dear OMPI users,

 

I configured and installed OpenMPI-1.4.2 and BLCR-0.8.2. (blade01 ¨C
blade10, nfs)

BLCR configure script: ./configure ¨Cprefix=/opt/blcr ¨Cenable-static

After the installation, I can see the ¡®blcr¡¯ module loaded correctly
(lsmod | grep blcr). And I can also run ¡®cr_run¡¯, ¡®cr_checkpoint¡¯,
¡®cr_restart¡¯ to C/R the examples correctly under /blcr/examples/.

Then, OMPI configure script is: ./configure ¨Cprefix=/opt/ompi ¨Cwith-ft=cr
¨Cwith-blcr=/opt/blcr ¨Cenable-ft-thread ¨Cenable-mpi-threads ¨C
enable-static

The installation is okay too.

 

Then here comes the problem.

On one node:

         mpirun -np 2 ./hello_c.c

         mpirun -np 2 ¨Cam ft-enable-cr ./hello_c.c

         are both okay.

On two nodes(blade01, blade02):

         mpirun ¨Cnp 2 ¨Cmachinefile mf ./hello_c.c OK.

mpirun ¨Cnp 2 ¨Cmachinefile mf ¨Cam ft-enable-cr ./hello_c.c ERROR. Listed
below:

 

*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[blade02:28896] Abort before MPI_INIT completed successfully; not able to
guarantee that all other processes were killed!
--------------------------------------------------------------------------
It looks like opal_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_cr_init() failed failed
  --> Returned value -1 instead of OPAL_SUCCESS
--------------------------------------------------------------------------
[blade02:28896] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in file
runtime/orte_init.c at line 77
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: orte_init failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------

 

I have no idea about the error. Our blades use nfs, does it matter? Can
anyone help me solve the problem? I really appreciate it. Thank you.

 

btw, similar error like:

¡°Oops, cr_init() failed (the initialization call to the BLCR checkpointing
system). Abort in despair.

The crmpi SSI subsystem failed to initialized modules successfully during
MPI_INIT. This is a fatal error; I must abort.¡± occurs when I use LAM/MPI +
BLCR.

 

Regards

 

whchen