Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segmentation fault when checkpointing
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2012-03-29 14:21:45


This is a bit of a non-answer, but can you try the 1.5 series (1.5.5
in the current release)? 1.4 is being phased out, and 1.5 will replace
it in the near future. 1.5 has a number of C/R related fixes that
might help.

-- Josh

On Thu, Mar 29, 2012 at 1:12 PM, Linton, Tom <tom.linton_at_[hidden]> wrote:
> We have a legacy application that runs fine on our cluster using Intel MPI
> with hundreds of cores. We ported it to OpenMPI so that we could use BLCR
> and it runs fine but checkpointing is not working properly:
>
>
>
> 1. when we checkpoint with more than 1 core, each MPI rank reports a
> segmentation fault for the MPI job and the ompi-checkpoint command does not
> return. For example, with two cores we get:
>
> [tscco28017:16352] *** Process received signal ***
>
> [tscco28017:16352] Signal: Segmentation fault (11)
>
> [tscco28017:16352] Signal code: Address not mapped (1)
>
> [tscco28017:16352] Failing at address: 0x7fffef51
>
> [tscco28017:16353] *** Process received signal ***
>
> [tscco28017:16353] Signal: Segmentation fault (11)
>
> [tscco28017:16353] Signal code: Address not mapped (1)
>
> [tscco28017:16353] Failing at address: 0x7fffef51
>
> [tscco28017:16353] [ 0] /lib64/libpthread.so.0(+0xf5d0) [0x7ffff698e5d0]
>
> [tscco28017:16353] [ 1] [0xf500b0]
>
> [tscco28017:16353] *** End of error message ***
>
> [tscco28017:16352] [ 0] /lib64/libpthread.so.0(+0xf5d0) [0x7ffff698e5d0]
>
> [tscco28017:16352] [ 1] [0xf500b0]
>
> [tscco28017:16352] *** End of error message ***
>
> --------------------------------------------------------------------------
>
> mpirun noticed that process rank 1 with PID 16353 on node tscco28017 exited
> on signal 11 (Segmentation fault).
>
> --------------------------------------------------------------------------
>
> When I execute the TotalView debugger on a resulting core file (I assume
> it’s for the rank 0 process), Totalview reports a null frame pointer and the
> stack is trashed (gdb shows a backtrace with 30 frames but shows no debug
> info).
>
>
>
> 2. Checkpointing with 1 core on the legacy program works.
>
> 3. Checkpointing with a simple test program on 16 cores works.
>
>
>
>
>
> Can you suggest how to debug this problem?
>
>
>
> Some additional information:
>
>
>
> ·        I execute the program like this: mpirun -am ft-enable-cr -n 2
> -machinefile machines program inputfile
>
> ·        We are using Open MPI 1.4.4 with BLCR 0.8.4
>
> ·        OpenMPI and the application were both compiled on the same machine
> using the Intel icc 12.0.4 compiler
>
> ·        For the failing example, both MPI processes are running on cores on
> the same machine node.
>
> ·        I have attached “ompi_info.txt”
>
> ·        We’re running on a single Xeon 5150 node with Gigabit Ethernet.
>
> ·        [Reuti: previously I reported a problem involving illegal
> instructions but this turned out to be a build problem. Sorry I didn’t
> answer your response to my previous thread but I was having problems with
> accessing this email list at that time.]
>
>
>
> Thanks
>
> Tom
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey