Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] change hosts to restart the checkpoint
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2010-03-05 10:03:41


This type of failure is usually due to prelink'ing being left enabled
on one or more of the systems. This has come up multiple times on the
Open MPI list, but is actually a problem between BLCR and the Linux
kernel. BLCR has a FAQ entry on this that you will want to check out:
   https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink

If that does not work, then we can look into other causes.

-- Josh

On Mar 5, 2010, at 3:06 AM, 马少杰 wrote:

>
>
>
> 2010-03-05
> 马少杰
> Dear Sir:
> I want to use openmpi and blcr to checkpoint.However, I want
> restart the check point
> on other hosts. For example, I run mpi program using openmpi on
> host1 and host2, and I save the checkpoint file at a nfs shared path.
> Then I wan to restart the job (ompi-restart -machinefile ma
> ompi_global_snapshot_15865.ckpt) on host3 and
> host4. The 4 host have same hardware and software. If I change the
> hostname (host3 and host4) on machinfile, the error always occur,
> [node182:27278] *** Process received signal ***
> [node182:27278] Signal: Segmentation fault (11)
> [node182:27278] Signal code: Address not mapped (1)
> [node182:27278] Failing at address: 0x3b81009530
> [node182:27275] *** Process received signal ***
> [node182:27275] Signal: Segmentation fault (11)
> [node182:27275] Signal code: Address not mapped (1)
> [node182:27275] Failing at address: 0x3b81009530
> [node182:27274] *** Process received signal ***
> [node182:27274] Signal: Segmentation fault (11)
> [node182:27274] Signal code: Address not mapped (1)
> [node182:27274] Failing at address: 0x3b81009530
> [node182:27276] *** Process received signal ***
> [node182:27276] Signal: Segmentation fault (11)
> [node182:27276] Signal code: Address not mapped (1)
> [node182:27276] Failing at address: 0x3b81009530
> --------------------------------------------------------------------------
> mpirun noticed that process rank 9 with PID 27973 on node node183
> exited on signal 11 (Segmentation fault).
>
> if I comeback the hostname as host1 and host2, it can restart
> succesfully.
>
> my openmpi version is 1.3.4
> ./configure --with-ft=cr --enable-mpi-threads --enable-ft-thread --
> with-blcr=$dir --with-blcr-libdir=/$dir/lib --prefix=$dir_ompi --
> enable-mpirun-prefix-by-default
>
> the command run the mpi progrom as
> mpirun -np 8 --am ft-enable-cr --mca opal_cr_use_thread 0 -
> machinefile ma ./cpi
>
> vim $HOME/.openmpi/mca-params.conf
> crs_base_snapshot_dir=/tmp/cr
> snapc_base_global_snapshot_dir=/disk/cr
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users