Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] ompi-restart issue : ompi-restart doesn't work across nodes - possible installation problem or environment setting problem??
From: arun dhakne (arundhakne_at_[hidden])
Date: 2008-10-06 15:55:27


Hi all,

This is the procedure i have followed to install openmpi. Is there
some installation or environment setting problem in here?
an openmpi program with 4 process is run across 2 dual-core intel
machines, with 2 processes running on each of the machine.

ompi-checkpoint is successful but ompi-restart fails with following error

$:> ompi-restart ompi_global_snapshot_6045.ckpt
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 6372 on node
acl-cadi-pentd-1.cse.buffalo.edu exited on signal 11 (Segmentation
fault).
--------------------------------------------------------------------------

Open-mpi installation steps:
./configure --prefix=/home/csgrad/audhakne/.openmpi --with-ft=cr
--with-blcr=/usr/lib64 --enable-debug
make
make install

export LD_LIBRARY_PATH=$HOME/.openmpi/lib/:$HOME/.openmpi/lib/openmpi:/usr/lib64
export PATH=$HOME/.openmpi/bin:$PATH

NOTE: blcr is installed as a module
$:> lsmod | grep blcr

blcr 117892 0
blcr_vmadump 58264 1 blcr
blcr_imports 46080 2 blcr,blcr_vmadump

Please let me know if there is problem with above procedure, thanks a
lot for your time.

Best.

---------- Forwarded message ----------
From: arun dhakne <arundhakne_at_[hidden]>
Date: Tue, Sep 30, 2008 at 12:52 AM
Subject: ompi-restart issue : ompi-restart doesn't work across nodes
To: Open MPI Users <users_at_[hidden]>

Hi all,

I had gone through some previous ompi-restart issues but i couldn't
find anything similar to this problem.

I have installed blcr, and configured open-mpi 'openmpi-1.3a1r19645'

i) If the sample mpi program say ( np 4 on single machine that is
without any hostfile )is ran and I try to checkpoint it, it happens
successfully and even ompi-restart works in this case.

ii) If the sample mpi program is ran across say 2 different nodes and
checkpoint happens successfully BUT ompi-restart throws following
error:

$ ompi-restart ompi_global_snapshot_7604.ckpt
--------------------------------------------------------------------------
mpirun noticed that process rank 3 with PID 9590 on node
acl-cadi-pentd-1.cse.buffalo.edu exited on signal 11 (Segmentation
fault).
--------------------------------------------------------------------------

Please let me know if more information is needed.

--
Thanks and Regards,
Arun U. Dhakne