Hi all,
I had gone through some previous ompi-restart issues but i couldn't
find anything similar to this problem.
I have installed blcr, and configured open-mpi 'openmpi-1.3a1r19645'
i) If the sample mpi program say ( np 4 on single machine that is
without any hostfile )is ran and I try to checkpoint it, it happens
successfully and even ompi-restart works in this case.
ii) If the sample mpi program is ran across say 2 different nodes and
checkpoint happens successfully BUT ompi-restart throws following
error:
[audhakne_at_acl-cadi-pentd-1 ~]$ ompi-restart ompi_global_snapshot_7604.ckpt
--------------------------------------------------------------------------
mpirun noticed that process rank 3 with PID 9590 on node
acl-cadi-pentd-1.cse.buffalo.edu exited on signal 11 (Segmentation
fault).
--------------------------------------------------------------------------
Please let me know if more information is needed.
--
Thanks and Regards,
Arun U. Dhakne
|