Subject: [OMPI users] Question about restart
From: Yaakoub El Khamra (yye00_at_[hidden])
Date: 2009-04-22 15:13:41

I am trying to get the checkpoint/restart to work on a single machine
with openmpi 1.3 (also tried an svn check-out) and ran into a few
problems. I am guessing I am doing something wrong, and would
appreciate some help.

I built openmpi with:
 ./configure --prefi=/usr/local/openmpi-1.3/ --enable-picky
--enable-debug --enable-mpi-f77 --enable-mpi-f90 --enable-mpi-profile
--enable-mpi-cxx --enable-pretty-print-stacktrace --enable-binaries
--enable-trace --enable-static=yes --enable-debug
--with-devel-headers=1 --with-mpi-param-check=always --with-ft=cr
--enable-ft-thread --with-blcr=/usr/local/blcr/
--with-blcr-libdir=/usr/local/blcr/lib --enable-mpi-threads=yes

I am using blcr 0.8.1 configured with:
 ./configure --prefix=/usr/local/blcr/ --enable-debug=yes
--enable-libcr-tracing=yes --enable-kernel-tracing=yes
--enable-testsuite=yes --enable-all-static=yes --enable-static=yes

Checkpoint works fine, without any problems, I run with:
 mpirun -np 2 -mca ft_cr_enabled 1 -mca ompi_cr_verbose 1 -am
ft-enable-cr -mca crs_verbose 1 -mca crs_blcr_verbose 1 matmultf.exe

I am able to checkpoint without any problems using ompi-checkpoint
--status --term <pid>
but when I try to restart, I get the following error:

[yye00_at_localhost FTOpenMPI]$ ompi-restart -v ompi_global_snapshot_23858.ckpt
[localhost.localdomain:24394] Checking for the existence of
[localhost.localdomain:24394] Restarting from file
[localhost.localdomain:24394] Exec in self
malloc debug: Invalid free (signal.c, 304)
malloc debug: Invalid free (signal.c, 304)
[localhost:23860] *** Process received signal ***
[localhost:23860] Signal: Bus error (7)
[localhost:23860] Signal code: (2)
[localhost:23860] Failing at address: 0x7fcbb737ef88
[localhost:23860] [ 0] /lib64/ [0x32d720f0f0]
[localhost:23860] [ 1] /usr/local/openmpi-1.3_svn/lib/
[localhost:23860] [ 2] /usr/local/openmpi-1.3_svn/lib/
[localhost:23860] [ 3] /usr/local/openmpi-1.3_svn/lib/
[localhost:23860] [ 4]
[localhost:23860] [ 5] /usr/local/openmpi-1.3_svn/lib/
[localhost:23860] [ 6] /usr/local/openmpi-1.3_svn/lib/
[localhost:23860] [ 7]
[localhost:23860] [ 8]
[localhost:23860] [ 9] /usr/local/openmpi-1.3_svn/lib/
[localhost:23860] [10]
[localhost:23860] [11] /usr/local/openmpi-1.3_svn/lib/
[localhost:23860] [12] /lib64/ [0x32d72073da]
[localhost:23860] [13] /lib64/ [0x32d66e62bd]
[localhost:23860] *** End of error message ***
mpirun noticed that process rank 1 with PID 24396 on node
localhost.localdomain exited on signal 7 (Bus error).

running strace on the ompi-restart did not provide any useful
information. Any suggestions are greatly appreciated. Incidentally,
looking at the signal.c line 304, it is a deallocation subroutine in
opal, it is the evsignal_dealloc subroutine, the actual line is the
"free(base->sig.sh_old);" line . I am about to add debug statements to
that subroutine and see if I can get further information, but was
hoping the problem is more user-related than code-related.

Yaakoub El Khamra