Subject: Re: [OMPI users] Openmpi Checkpoint/Restart failed
From: ÃÏÏܾü (xjun.meng_at_[hidden])
Date: 2010-12-23 04:35:01

My main question is:

after I finished the checkpoint operation against a simple task which ran on
tow machines, I can only restart it on one machine. if I ran the following
command to force the ompi-restart to run the program on two machines:

*ompi-restart -hostfile ./machine_names ompi_global_snapshot_XXX.ckpt*
(the machine_names include two host names)

the output is:
Error: Unable to obtain the proper restart command to restart from the
       checkpoint file (opal_snapshot_1.ckpt). Returned -1.

[jx-mpi-fcr048:04116] [ 0] /lib64/tls/ [0x302b80c420]
[jx-mpi-fcr048:04116] [ 1] /lib64/tls/
[jx-mpi-fcr048:04116] [ 2]
[jx-mpi-fcr048:04116] [ 3]
[jx-mpi-fcr048:04116] [ 4]
[jx-mpi-fcr048:04116] [ 5] opal-restart [0x401888]
[jx-mpi-fcr048:04116] [ 6] /lib64/tls/
[jx-mpi-fcr048:04116] [ 7] opal-restart [0x40147a]
[jx-mpi-fcr048:04116] *** End of error message ***
mpirun noticed that process rank 1 with PID 4116 on node exited on signal 11 (Segmentation fault).

My is:

*# Seq: 0
# Timestamp: Thu Dec 23 16:39:46 2010
# Process: 1680080897.0
# OPAL CRS Component: blcr
# Snapshot Reference: opal_snapshot_0.ckpt
# Snapshot Location: /home/work/checkpoint/ompi_global_snapshot_22817.ckpt/0
# Process: 1680080897.1
# OPAL CRS Component: blcr
# Snapshot Reference: opal_snapshot_1.ckpt
# Snapshot Location: /home/work/checkpoint/ompi_global_snapshot_22817.ckpt/0
# Timestamp: Thu Dec 23 16:39:47 2010
# Finished Seq: 0*

Does anabody know why?

Xianjun Meng

2010/12/23 ÃÏÏܾü <xjun.meng_at_[hidden]>

> Dear all,
> I had to try the checkpoint/restart function of Openmpi recently, and after
> several failure and checking lots of the docement, I am still very confused
> about how to config the checkpoint/restart function. Can anybody give me a
> $HOME/.openmpi/mca-params.conf script and introduce me what parameters I
> should specified when i install the openmpi?
> BTW, I want to install the openmpi1.5.1 and blcr 0.8.0.
> Thanks
> Xianjun Meng