Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI with BLCR runtime problem
From: ³ÂÎÄºÆ (whchen_at_[hidden])
Date: 2010-08-25 02:03:30


I was so careless. BLCR Admin Guide says: as the root, load the kernel
modules in this order:
    # /sbin/insmod /usr/local/lib/blcr/2.6.12-1.234/blcr_imports.ko
    # /sbin/insmod /usr/local/lib/blcr/2.6.12-1.234/blcr.ko
In the last email, I load the kernel in the wrong order. And I followed the
order above, it succeeded. lol
I really thank you for your advice, Josh. Many thanks.

I really thank you for your advice, Josh. As you say, when check 'lsmod |
grep blcr' on blade02, nothing shows. That means no blcr module is inserted
on blade02. I think that's the main reason why I can't C/R mpi programs on
these two nodes.
But here is the problem:
I installed blcr under /opt/blcr on blade01. Our blades use NFS. /opt/
directory and /home/ directory are shared. And on blade02, such commands
like 'cr_run', 'cr_restart' can be found. But I can't insert blcr module on
blade02. It shows:
insmod: error inserting '/opt/blcr/lib/blcr/2.6.16.60-0.21-smp/blcr.ko': -1
Unknown symbol in module Does it mean that I have to install blcr on
blade02? If so, where should I install it? Just cover /opt/blcr or somewhere
else?
Plz give me some advice. Thank you.

On Aug 24, 2010, at 10:27 AM, ?????? wrote:

> Dear OMPI users,
>
> I configured and installed OpenMPI-1.4.2 and BLCR-0.8.2. (blade01 ?C
> blade10, nfs) BLCR configure script: ./configure ?Cprefix=/opt/blcr
> ?Cenable-static After the installation, I can see the ??blcr?? module
loaded correctly (lsmod | grep blcr). And I can also run ??cr_run??,
??cr_checkpoint??, ??cr_restart?? to C/R the examples correctly under
/blcr/examples/.
> Then, OMPI configure script is: ./configure ?Cprefix=/opt/ompi
> ?Cwith-ft=cr ?Cwith-blcr=/opt/blcr ?Cenable-ft-thread ?Cenable-mpi-threads
?Cenable-static The installation is okay too.
>
> Then here comes the problem.
> On one node:
> mpirun -np 2 ./hello_c.c
> mpirun -np 2 ?Cam ft-enable-cr ./hello_c.c
> are both okay.
> On two nodes(blade01, blade02):
> mpirun ?Cnp 2 ?Cmachinefile mf ./hello_c.c OK.
> mpirun ?Cnp 2 ?Cmachinefile mf ?Cam ft-enable-cr ./hello_c.c ERROR. Listed
below:
>
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [blade02:28896]
> Abort before MPI_INIT completed successfully; not able to guarantee that
all other processes were killed!
> ----------------------------------------------------------------------
> ---- It looks like opal_init failed for some reason; your parallel
> process is likely to abort. There are many reasons that a parallel
> process can fail during opal_init; some of which are due to
> configuration or environment problems. This failure appears to be an
> internal failure; here's some additional information (which may only
> be relevant to an Open MPI developer):
> opal_cr_init() failed failed
> --> Returned value -1 instead of OPAL_SUCCESS
> ----------------------------------------------------------------------
> ---- [blade02:28896] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in file
> runtime/orte_init.c at line 77
> ----------------------------------------------------------------------
> ---- It looks like MPI_INIT failed for some reason; your parallel
> process is likely to abort. There are many reasons that a parallel
> process can fail during MPI_INIT; some of which are due to
> configuration or environment problems. This failure appears to be an
> internal failure; here's some additional information (which may only
> be relevant to an Open MPI
> developer):
> ompi_mpi_init: orte_init failed
> --> Returned "Error" (-1) instead of "Success" (0)
> ----------------------------------------------------------------------
> ----
>
> I have no idea about the error. Our blades use nfs, does it matter? Can
anyone help me solve the problem? I really appreciate it. Thank you.
>
> btw, similar error like:
> ??Oops, cr_init() failed (the initialization call to the BLCR
checkpointing system). Abort in despair.
> The crmpi SSI subsystem failed to initialized modules successfully during
MPI_INIT. This is a fatal error; I must abort.?? occurs when I use LAM/MPI +
BLCR.

This seems to indicate that BLCR is not working correctly on one of the
compute nodes. Did you try some of the BLCR example programs on both of the
compute nodes? If BLCRs cr_init() fails, then there is not much the MPI
library can do for you.

I would check the installation of BLCR on all of the compute nodes (blade01
and blade02). Make sure the modules are loaded and that the BLCR single
process examples work on all nodes. I suspect that one of the nodes is
having trouble initializing the BLCR library.

You may also want to check to make sure prelinking is turned off on all
nodes as well:
  https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink

If that doesn't work then I would suggest trying the current Open MPI trunk.
There should not be any problem with using NFS, since this is occurring in
MPI_Init, this is well before we ever try to use the file system. I also
test with NFS, and local staging on a fairly regular basis, so it shouldn't
be a problem even when checkpointing/restarting.

-- Josh

>
> Regards
>
> whchen
>
> <ATT00001..txt>

------------------------------------
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://www.cs.indiana.edu/~jjhursey