Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] running MPI application and using C/R OpenMPI 1.5.3
From: Marcin Zielinski (Marcin.Zielinski_at_[hidden])
Date: 2011-06-06 02:57:41


Hello,

Did anyone try to fiddle with this riddle of mine ?

> $> mpirun -n 2 -am ft-enable-cr ./myapp < <inputfile for
> myapp>
>
> $> ompi-checkpoint -s --term <MPIRUN PID>
> produces the following error for myapp (in case of -n 2):
>
> forrtl: severe (174): SIGSEGV, segmentation fault occurred
> forrtl: severe (174): SIGSEGV, segmentation fault occurred
> [hostname:29664] local) Error: Unable to read state from named pipe
> (/global_dir/opal_cr_prog_write.29666). 0
> [hostname:29664] [[27518,0],0] ORTE_ERROR_LOG: Error in file
> snapc_full_local.c at line 1602
>
--------------------------------------------------------------------------
> mpirun has exited due to process rank 1 with PID 29666 on
> node hostname exiting improperly. There are two reasons this could occur:
>
> 1. this process did not call "init" before exiting, but others in
> the job did. This can cause a job to hang indefinitely while it waits
> for all processes to call "init". By rule, if one process calls "init",
> then ALL processes must call "init" prior to termination.
>
> 2. this process called "init", but exited without calling "finalize".
> By rule, all processes that call "init" MUST call "finalize" prior to
> exiting or it will be considered an "abnormal termination"
>
> This may have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
>
--------------------------------------------------------------------------
>
> The /global_dir/, /local_dir/ and /tmp_dir/ are all write/readable by
> the user which invokes the mpirun myapp.

best regards,

On 05/30/2011 11:25 AM, Marcin Zielinski wrote:
> Dear all,
>
> After looking for various topics to solve my problem, I'm forced to turn
> to You all in here. Tho I have to say, I did not find the following
> problem yet. Could be something amazingly easy to solve.
>
> Anyways, I'm running a MPI application, compiled with an OpenMPI 1.5.3.
> OpenMPI 1.5.3 been compiled with BLCR support. BLCR been compiled with
> no errors and works fine. The configure looks like this:
> export CC='icc'
> export CXX='icpc'
> export F77='ifort'
> export FC='ifort'
> export F90='ifort'
> export FCFLAGS='-O2'
> export FFLAGS='-O2'
> export CFLAGS='-O2'
> export CXXFLAGS='-O2'
> export OMP_NUM_THREADS='1'
> # export CPP='cpp'
>
> export LD_RUNPATH=$installdir/lib
>
> make clean
> ./configure --prefix=$installdir \
> --enable-orterun-prefix-by-default \
> --with-openib=$ofed \
> --enable-mpi-threads \
> --enable-ft-thread \
> --with-ft=cr \
> --with-blcr=/path_to_blcr_0.8.2_build_dir/ \
> --with-blcr-libdir=/path_to_blcr_lib_dir/ \
> --disable-dlopen \
> && \
> make && make install || exit
>
> ifort and icc are:
> $ ifort --version
> ifort (IFORT) 11.0 20080930 / 11.0.069 64bit
>
> $ icc --version
> icc (ICC) 11.0 20080930 / 11.0.074 64bit
>
> The MPI application (let's skip the name and what it does) runs
> perfectly fine when invoking:
> mpirun ./myapp < <inputfile for myapp> (running serial on parallel code)
>
> and when invoking:
> mpirun -n <nr of cores> ./myapp < <inputfile for myapp>
>
> In both cases it always produces the right results from calculations.
>
> Now, enabling C/R works for one case only:
> mpirun -am ft-enable-cr ./myapp < <inputfile for myapp> (running serial
> on parallel code with C/R enabled)
>
> later on, invoking ompi-checkpoint -s --term <MPIRUN PID>
> produces a nice global snapshot and
> ompi-restart <GLOBAL SNAPSHOT NAME>
> re-runs the calculations from the checkpointed point perfectly fine,
> finishing it to the end with a proper results.
>
> Now, invoking
> mpirun -n <nr of cores > 1> -am ft-enable-cr ./myapp < <inputfile for
> myapp>
>
> and checkpointing:
> ompi-checkpoint -s --term <MPIRUN PID>
> produces the following error for myapp (in case of -n 2):
>
> forrtl: severe (174): SIGSEGV, segmentation fault occurred
> forrtl: severe (174): SIGSEGV, segmentation fault occurred
> [hostname:29664] local) Error: Unable to read state from named pipe
> (/global_dir/opal_cr_prog_write.29666). 0
> [hostname:29664] [[27518,0],0] ORTE_ERROR_LOG: Error in file
> snapc_full_local.c at line 1602
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 1 with PID 29666 on
> node hostname exiting improperly. There are two reasons this could occur:
>
> 1. this process did not call "init" before exiting, but others in
> the job did. This can cause a job to hang indefinitely while it waits
> for all processes to call "init". By rule, if one process calls "init",
> then ALL processes must call "init" prior to termination.
>
> 2. this process called "init", but exited without calling "finalize".
> By rule, all processes that call "init" MUST call "finalize" prior to
> exiting or it will be considered an "abnormal termination"
>
> This may have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
>
> The /global_dir/, /local_dir/ and /tmp_dir/ are all write/readable by
> the user which invokes the mpirun myapp.
>
> Any suggestions on top of You heads ?
> I would appreciate any help on this.
>
> Best regards,