Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Openmpi Checkpoint/Restart failed
From: ÃÏÏܾü (xjun.meng_at_[hidden])
Date: 2010-12-23 21:03:55


Dear all,

I have figured it out. It was a simple issue, I didn't add the "blcr lib" to
the $PATH environment varable. However, it can make checkpoint operation,
but can't make restart operation successfully. It was so wield.

Best regards
Xianjun Meng

ÔÚ 2010Äê12ÔÂ23ÈÕ ÏÂÎç5:35£¬ÃÏÏܾü <xjun.meng_at_[hidden]>дµÀ£º

> My main question is:
>
> after I finished the checkpoint operation against a simple task which ran
> on tow machines, I can only restart it on one machine. if I ran the
> following command to force the ompi-restart to run the program on two
> machines:
>
> *ompi-restart -hostfile ./machine_names ompi_global_snapshot_XXX.ckpt*
> (the machine_names include two host names)
>
> the output is:
> *
> --------------------------------------------------------------------------
> Error: Unable to obtain the proper restart command to restart from the
> checkpoint file (opal_snapshot_1.ckpt). Returned -1.
>
> --------------------------------------------------------------------------
> [jx-mpi-fcr048:04116] [ 0] /lib64/tls/libpthread.so.0 [0x302b80c420]
> [jx-mpi-fcr048:04116] [ 1] /lib64/tls/libc.so.6(__libc_free+0x25)
> [0x302af68b85]
> [jx-mpi-fcr048:04116] [ 2]
> /home/hpc_meng/openmpi/lib/libopen-pal.so.0(opal_argv_free+0x41)
> [0x2a9557de31]
> [jx-mpi-fcr048:04116] [ 3]
> /home/hpc_meng/openmpi/lib/libopen-pal.so.0(opal_event_fini+0x27)
> [0x2a95573ac7]
> [jx-mpi-fcr048:04116] [ 4]
> /home/hpc_meng/openmpi/lib/libopen-pal.so.0(opal_finalize+0x2f)
> [0x2a95568a0f]
> [jx-mpi-fcr048:04116] [ 5] opal-restart [0x401888]
> [jx-mpi-fcr048:04116] [ 6] /lib64/tls/libc.so.6(__libc_start_main+0xdb)
> [0x302af1c4bb]
> [jx-mpi-fcr048:04116] [ 7] opal-restart [0x40147a]
> [jx-mpi-fcr048:04116] *** End of error message ***
> --------------------------------------------------------------------------
> mpirun noticed that process rank 1 with PID 4116 on node
> jx-mpi-fcr048.jx.baidu.com exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
> *
>
> My global_snapshot_meta.data is:
>
> *# Seq: 0
> # Timestamp: Thu Dec 23 16:39:46 2010
> # Process: 1680080897.0
> # OPAL CRS Component: blcr
> # Snapshot Reference: opal_snapshot_0.ckpt
> # Snapshot Location:
> /home/work/checkpoint/ompi_global_snapshot_22817.ckpt/0
> # Process: 1680080897.1
> # OPAL CRS Component: blcr
> # Snapshot Reference: opal_snapshot_1.ckpt
> # Snapshot Location:
> /home/work/checkpoint/ompi_global_snapshot_22817.ckpt/0
> # Timestamp: Thu Dec 23 16:39:47 2010
> # Finished Seq: 0*
>
> Does anabody know why?
>
> Thanks
> Xianjun Meng
>
>
> 2010/12/23 ÃÏÏܾü <xjun.meng_at_[hidden]>
>
> Dear all,
>>
>> I had to try the checkpoint/restart function of Openmpi recently, and
>> after several failure and checking lots of the docement, I am still very
>> confused about how to config the checkpoint/restart function. Can anybody
>> give me a $HOME/.openmpi/mca-params.conf script and introduce me what
>> parameters I should specified when i install the openmpi?
>>
>> BTW, I want to install the openmpi1.5.1 and blcr 0.8.0.
>>
>>
>> Thanks
>> Xianjun Meng
>>
>
>