Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] running MPI application and using C/R OpenMPI 1.5.3
From: Marcin Zielinski (Marcin.Zielinski_at_[hidden])
Date: 2011-05-30 05:25:30


Dear all,

After looking for various topics to solve my problem, I'm forced to turn
to You all in here. Tho I have to say, I did not find the following
problem yet. Could be something amazingly easy to solve.

Anyways, I'm running a MPI application, compiled with an OpenMPI 1.5.3.
OpenMPI 1.5.3 been compiled with BLCR support. BLCR been compiled with
no errors and works fine. The configure looks like this:
export CC='icc'
export CXX='icpc'
export F77='ifort'
export FC='ifort'
export F90='ifort'
export FCFLAGS='-O2'
export FFLAGS='-O2'
export CFLAGS='-O2'
export CXXFLAGS='-O2'
export OMP_NUM_THREADS='1'
# export CPP='cpp'

export LD_RUNPATH=$installdir/lib

make clean
./configure --prefix=$installdir \
   --enable-orterun-prefix-by-default \
   --with-openib=$ofed \
   --enable-mpi-threads \
   --enable-ft-thread \
   --with-ft=cr \
   --with-blcr=/path_to_blcr_0.8.2_build_dir/ \
   --with-blcr-libdir=/path_to_blcr_lib_dir/ \
   --disable-dlopen \
   && \
make && make install || exit

ifort and icc are:
$ ifort --version
ifort (IFORT) 11.0 20080930 / 11.0.069 64bit

$ icc --version
icc (ICC) 11.0 20080930 / 11.0.074 64bit

The MPI application (let's skip the name and what it does) runs
perfectly fine when invoking:
mpirun ./myapp < <inputfile for myapp> (running serial on parallel code)

and when invoking:
mpirun -n <nr of cores> ./myapp < <inputfile for myapp>

In both cases it always produces the right results from calculations.

Now, enabling C/R works for one case only:
mpirun -am ft-enable-cr ./myapp < <inputfile for myapp> (running serial
on parallel code with C/R enabled)

later on, invoking ompi-checkpoint -s --term <MPIRUN PID>
produces a nice global snapshot and
ompi-restart <GLOBAL SNAPSHOT NAME>
re-runs the calculations from the checkpointed point perfectly fine,
finishing it to the end with a proper results.

Now, invoking
mpirun -n <nr of cores > 1> -am ft-enable-cr ./myapp < <inputfile for myapp>

and checkpointing:
ompi-checkpoint -s --term <MPIRUN PID>
produces the following error for myapp (in case of -n 2):

forrtl: severe (174): SIGSEGV, segmentation fault occurred
forrtl: severe (174): SIGSEGV, segmentation fault occurred
[hostname:29664] local) Error: Unable to read state from named pipe
(/global_dir/opal_cr_prog_write.29666). 0
[hostname:29664] [[27518,0],0] ORTE_ERROR_LOG: Error in file
snapc_full_local.c at line 1602
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 29666 on
node hostname exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

The /global_dir/, /local_dir/ and /tmp_dir/ are all write/readable by
the user which invokes the mpirun myapp.

Any suggestions on top of You heads ?
I would appreciate any help on this.

Best regards,

-- 
Marcin Zielinski