Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] change hosts to restart the checkpoint
From: ÂíÉÙ½Ü (mashao_jie_at_[hidden])
Date: 2010-03-05 03:06:20





2010-03-05



ÂíÉÙ½Ü



Dear Sir:
   I want to use openmpi and blcr to checkpoint.However, I want restart the check point
on other hosts. For example, I run mpi program using openmpi on
host1 and host2, and I save the checkpoint file at a nfs shared path.
Then I wan to restart the job (ompi-restart -machinefile ma ompi_global_snapshot_15865.ckpt) on host3 and
 host4. The 4 host have same hardware and software. If I change the hostname (host3 and host4) on machinfile, the error always occur,
 [node182:27278] *** Process received signal ***
[node182:27278] Signal: Segmentation fault (11)
[node182:27278] Signal code: Address not mapped (1)
[node182:27278] Failing at address: 0x3b81009530
[node182:27275] *** Process received signal ***
[node182:27275] Signal: Segmentation fault (11)
[node182:27275] Signal code: Address not mapped (1)
[node182:27275] Failing at address: 0x3b81009530
[node182:27274] *** Process received signal ***
[node182:27274] Signal: Segmentation fault (11)
[node182:27274] Signal code: Address not mapped (1)
[node182:27274] Failing at address: 0x3b81009530
[node182:27276] *** Process received signal ***
[node182:27276] Signal: Segmentation fault (11)
[node182:27276] Signal code: Address not mapped (1)
[node182:27276] Failing at address: 0x3b81009530
--------------------------------------------------------------------------
mpirun noticed that process rank 9 with PID 27973 on node node183 exited on signal 11 (Segmentation fault).

  if I comeback the hostname as host1 and host2, it can restart succesfully.

 my openmpi version is 1.3.4
 ./configure --with-ft=cr --enable-mpi-threads --enable-ft-thread --with-blcr=$dir --with-blcr-libdir=/$dir/lib --prefix=$dir_ompi --enable-mpirun-prefix-by-default

 the command run the mpi progrom as
mpirun -np 8 --am ft-enable-cr --mca opal_cr_use_thread 0 -machinefile ma ./cpi

vim $HOME/.openmpi/mca-params.conf
crs_base_snapshot_dir=/tmp/cr
snapc_base_global_snapshot_dir=/disk/cr