Dear Sir:
I want to use openmpi and blcr to
checkpoint.However, I want restart the check point
on other hosts. For example, I run mpi program using openmpi
on
host1 and host2, and I save the checkpoint file at a nfs shared path.
Then I wan to restart the job (ompi-restart
-machinefile ma
ompi_global_snapshot_15865.ckpt) on host3 and
host4. The 4 host have same hardware and software.
If I change the hostname (host3 and host4) on machinfile, the error
always occur,
[node182:27278] *** Process received signal ***
[node182:27278]
Signal: Segmentation fault (11)
[node182:27278] Signal code: Address not
mapped (1)
[node182:27278] Failing at address:
0x3b81009530
[node182:27275] *** Process received signal
***
[node182:27275] Signal: Segmentation fault (11)
[node182:27275] Signal
code: Address not mapped (1)
[node182:27275] Failing at address:
0x3b81009530
[node182:27274] *** Process received signal
***
[node182:27274] Signal: Segmentation fault (11)
[node182:27274] Signal
code: Address not mapped (1)
[node182:27274] Failing at address:
0x3b81009530
[node182:27276] *** Process received signal
***
[node182:27276] Signal: Segmentation fault (11)
[node182:27276] Signal
code: Address not mapped (1)
[node182:27276] Failing at address:
0x3b81009530
--------------------------------------------------------------------------
mpirun
noticed that process rank 9 with PID 27973 on node node183 exited on signal 11
(Segmentation fault).
if I comeback the hostname as host1 and host2, it can restart
succesfully.
my openmpi version is 1.3.4
./configure --with-ft=cr --enable-mpi-threads
--enable-ft-thread --with-blcr=$dir --with-blcr-libdir=/$dir/lib
--prefix=$dir_ompi --enable-mpirun-prefix-by-default
the command run the mpi progrom as
mpirun -np 8 --am ft-enable-cr --mca opal_cr_use_thread 0
-machinefile ma ./cpi
vim $HOME/.openmpi/mca-params.conf
crs_base_snapshot_dir=/tmp/cr
snapc_base_global_snapshot_dir=/disk/cr