Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI with BLCR runtime problem
From: Joshua Hursey (jjhursey_at_[hidden])
Date: 2010-08-24 10:49:05


On Aug 24, 2010, at 10:27 AM, ³ÂÎÄºÆ wrote:

> Dear OMPI users,
>
> I configured and installed OpenMPI-1.4.2 and BLCR-0.8.2. (blade01 ¨C blade10, nfs)
> BLCR configure script: ./configure ¨Cprefix=/opt/blcr ¨Cenable-static
> After the installation, I can see the ¡®blcr¡¯ module loaded correctly (lsmod | grep blcr). And I can also run ¡®cr_run¡¯, ¡®cr_checkpoint¡¯, ¡®cr_restart¡¯ to C/R the examples correctly under /blcr/examples/.
> Then, OMPI configure script is: ./configure ¨Cprefix=/opt/ompi ¨Cwith-ft=cr ¨Cwith-blcr=/opt/blcr ¨Cenable-ft-thread ¨Cenable-mpi-threads ¨Cenable-static
> The installation is okay too.
>
> Then here comes the problem.
> On one node:
> mpirun -np 2 ./hello_c.c
> mpirun -np 2 ¨Cam ft-enable-cr ./hello_c.c
> are both okay.
> On two nodes(blade01, blade02):
> mpirun ¨Cnp 2 ¨Cmachinefile mf ./hello_c.c OK.
> mpirun ¨Cnp 2 ¨Cmachinefile mf ¨Cam ft-enable-cr ./hello_c.c ERROR. Listed below:
>
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [blade02:28896] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
> --------------------------------------------------------------------------
> It looks like opal_init failed for some reason; your parallel process is
> likely to abort. There are many reasons that a parallel process can
> fail during opal_init; some of which are due to configuration or
> environment problems. This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> opal_cr_init() failed failed
> --> Returned value -1 instead of OPAL_SUCCESS
> --------------------------------------------------------------------------
> [blade02:28896] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in file runtime/orte_init.c at line 77
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort. There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems. This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
> ompi_mpi_init: orte_init failed
> --> Returned "Error" (-1) instead of "Success" (0)
> --------------------------------------------------------------------------
>
> I have no idea about the error. Our blades use nfs, does it matter? Can anyone help me solve the problem? I really appreciate it. Thank you.
>
> btw, similar error like:
> ¡°Oops, cr_init() failed (the initialization call to the BLCR checkpointing system). Abort in despair.
> The crmpi SSI subsystem failed to initialized modules successfully during MPI_INIT. This is a fatal error; I must abort.¡± occurs when I use LAM/MPI + BLCR.

This seems to indicate that BLCR is not working correctly on one of the compute nodes. Did you try some of the BLCR example programs on both of the compute nodes? If BLCRs cr_init() fails, then there is not much the MPI library can do for you.

I would check the installation of BLCR on all of the compute nodes (blade01 and blade02). Make sure the modules are loaded and that the BLCR single process examples work on all nodes. I suspect that one of the nodes is having trouble initializing the BLCR library.

You may also want to check to make sure prelinking is turned off on all nodes as well:
  https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink

If that doesn't work then I would suggest trying the current Open MPI trunk. There should not be any problem with using NFS, since this is occurring in MPI_Init, this is well before we ever try to use the file system. I also test with NFS, and local staging on a fairly regular basis, so it shouldn't be a problem even when checkpointing/restarting.

-- Josh

>
> Regards
>
> whchen
>
> <ATT00001..txt>

------------------------------------
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://www.cs.indiana.edu/~jjhursey