Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Openmpi Checkpoint/Restart failed
From: ÃÏÏܾü (xjun.meng_at_[hidden])
Date: 2010-12-23 04:35:01


My main question is:

after I finished the checkpoint operation against a simple task which ran on
tow machines, I can only restart it on one machine. if I ran the following
command to force the ompi-restart to run the program on two machines:

*ompi-restart -hostfile ./machine_names ompi_global_snapshot_XXX.ckpt*
(the machine_names include two host names)

the output is:
*--------------------------------------------------------------------------
Error: Unable to obtain the proper restart command to restart from the
       checkpoint file (opal_snapshot_1.ckpt). Returned -1.

--------------------------------------------------------------------------
[jx-mpi-fcr048:04116] [ 0] /lib64/tls/libpthread.so.0 [0x302b80c420]
[jx-mpi-fcr048:04116] [ 1] /lib64/tls/libc.so.6(__libc_free+0x25)
[0x302af68b85]
[jx-mpi-fcr048:04116] [ 2]
/home/hpc_meng/openmpi/lib/libopen-pal.so.0(opal_argv_free+0x41)
[0x2a9557de31]
[jx-mpi-fcr048:04116] [ 3]
/home/hpc_meng/openmpi/lib/libopen-pal.so.0(opal_event_fini+0x27)
[0x2a95573ac7]
[jx-mpi-fcr048:04116] [ 4]
/home/hpc_meng/openmpi/lib/libopen-pal.so.0(opal_finalize+0x2f)
[0x2a95568a0f]
[jx-mpi-fcr048:04116] [ 5] opal-restart [0x401888]
[jx-mpi-fcr048:04116] [ 6] /lib64/tls/libc.so.6(__libc_start_main+0xdb)
[0x302af1c4bb]
[jx-mpi-fcr048:04116] [ 7] opal-restart [0x40147a]
[jx-mpi-fcr048:04116] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 4116 on node
jx-mpi-fcr048.jx.baidu.com exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------*

My global_snapshot_meta.data is:

*# Seq: 0
# Timestamp: Thu Dec 23 16:39:46 2010
# Process: 1680080897.0
# OPAL CRS Component: blcr
# Snapshot Reference: opal_snapshot_0.ckpt
# Snapshot Location: /home/work/checkpoint/ompi_global_snapshot_22817.ckpt/0
# Process: 1680080897.1
# OPAL CRS Component: blcr
# Snapshot Reference: opal_snapshot_1.ckpt
# Snapshot Location: /home/work/checkpoint/ompi_global_snapshot_22817.ckpt/0
# Timestamp: Thu Dec 23 16:39:47 2010
# Finished Seq: 0*

Does anabody know why?

Thanks
Xianjun Meng

2010/12/23 ÃÏÏܾü <xjun.meng_at_[hidden]>

> Dear all,
>
> I had to try the checkpoint/restart function of Openmpi recently, and after
> several failure and checking lots of the docement, I am still very confused
> about how to config the checkpoint/restart function. Can anybody give me a
> $HOME/.openmpi/mca-params.conf script and introduce me what parameters I
> should specified when i install the openmpi?
>
> BTW, I want to install the openmpi1.5.1 and blcr 0.8.0.
>
>
> Thanks
> Xianjun Meng
>