Dear OMPI users,
I configured and installed OpenMPI-1.4.2 and BLCR-0.8.2. (blade01 ¨C
blade10, nfs)
BLCR configure script: ./configure ¨Cprefix=/opt/blcr ¨Cenable-static
After the installation, I can see the ¡®blcr¡¯ module loaded correctly
(lsmod | grep blcr). And I can also run ¡®cr_run¡¯, ¡®cr_checkpoint¡¯,
¡®cr_restart¡¯ to C/R the examples correctly under /blcr/examples/.
Then, OMPI configure script is: ./configure ¨Cprefix=/opt/ompi ¨Cwith-ft=cr
¨Cwith-blcr=/opt/blcr ¨Cenable-ft-thread ¨Cenable-mpi-threads ¨C
enable-static
The installation is okay too.
Then here comes the problem.
On one node:
mpirun -np 2 ./hello_c.c
mpirun -np 2 ¨Cam ft-enable-cr ./hello_c.c
are both okay.
On two nodes(blade01, blade02):
mpirun ¨Cnp 2 ¨Cmachinefile mf ./hello_c.c OK.
mpirun ¨Cnp 2 ¨Cmachinefile mf ¨Cam ft-enable-cr ./hello_c.c ERROR. Listed
below:
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[blade02:28896] Abort before MPI_INIT completed successfully; not able to
guarantee that all other processes were killed!
--------------------------------------------------------------------------
It looks like opal_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
opal_cr_init() failed failed
--> Returned value -1 instead of OPAL_SUCCESS
--------------------------------------------------------------------------
[blade02:28896] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in file
runtime/orte_init.c at line 77
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
ompi_mpi_init: orte_init failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
I have no idea about the error. Our blades use nfs, does it matter? Can
anyone help me solve the problem? I really appreciate it. Thank you.
btw, similar error like:
¡°Oops, cr_init() failed (the initialization call to the BLCR checkpointing
system). Abort in despair.
The crmpi SSI subsystem failed to initialized modules successfully during
MPI_INIT. This is a fatal error; I must abort.¡± occurs when I use LAM/MPI +
BLCR.
Regards
whchen
|