Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Open MPI Checkpoint Restart
From: Neel Sunil Desai (Neel.Desai_at_[hidden])
Date: 2013-06-04 11:57:45


Hi,

So, I was able to remove the "cannot open shared file or object" errors.
But I am not able to checkpoint yet. When I enter ompi-checkpoint PID of
mpirun, it does not return anything (not even a new prompt). In my
mca-params.conf file, I added

sstore=stage

sstore_stage_local_snapshot_dir=/tmp/ndesai/local
sstore_base_global_snapshot_dir=/tmp/ndesai/global

I created the local and global folders myself.

I am running all the processes on a single machine.
What am I doing wrong? Please guide me.

Thanks,
Neel.

On Mon, Jun 3, 2013 at 9:34 AM, Neel Sunil Desai <Neel.Desai_at_[hidden]>wrote:

> Hi Ralph.
>
> I checked the errors.
> I do not understand what the fololowing means : The session directory
> location could not be parsed.
>
> ompi-checkpoint attempted to use the session directory:
> /tmp/openmpi-sessions-ndesai_at_vcainternmpi01_0
> I opened the /tmp/openmpi-sessions-ndesai directory and various
> directories are created.
>
> Also, when I run the mpi program, I get the following errors before the
> program starts running correctly:
>
> [ndesai_at_vcainternmpi01 work]$ mpirun -am ft-enable-cr --np 16
> ./DecoderTest ../../decoder/test.ini
> [vcainternmpi01:25341] mca: base: component_find: unable to open
> /home/ndesai/mpicr/lib/openmpi/mca_crs_blcr: libcr.so.0: cannot open shared
> object file: No such file or directory (ignored)
> [vcainternmpi01:25342] mca: base: component_find: unable to open
> /home/ndesai/mpicr/lib/openmpi/mca_crs_blcr: libcr.so.0: cannot open shared
> object file: No such file or directory (ignored)
> [vcainternmpi01:25343] mca: base: component_find: unable to open
> /home/ndesai/mpicr/lib/openmpi/mca_crs_blcr: libcr.so.0: cannot open shared
> object file: No such file or directory (ignored)
> [vcainternmpi01:25344] mca: base: component_find: unable to open
> /home/ndesai/mpicr/lib/openmpi/mca_crs_blcr: libcr.so.0: cannot open shared
> object file: No such file or directory (ignored)
> [vcainternmpi01:25347] mca: base: component_find: unable to open
> /home/ndesai/mpicr/lib/openmpi/mca_crs_blcr: libcr.so.0: cannot open shared
> object file: No such file or directory (ignored)
> [vcainternmpi01:25354] mca: base: component_find: unable to open
> /home/ndesai/mpicr/lib/openmpi/mca_crs_blcr: libcr.so.0: cannot open shared
> object file: No such file or directory (ignored)
> [vcainternmpi01:25356] mca: base: component_find: unable to open
> /home/ndesai/mpicr/lib/openmpi/mca_crs_blcr: libcr.so.0: cannot open shared
> object file: No such file or directory (ignored)
> [vcainternmpi01:25337] mca: base: component_find: unable to open
> /home/ndesai/mpicr/lib/openmpi/mca_crs_blcr: libcr.so.0: cannot open shared
> object file: No such file or directory (ignored)
> [vcainternmpi01:25338] mca: base: component_find: unable to open
> /home/ndesai/mpicr/lib/openmpi/mca_crs_blcr: libcr.so.0: cannot open shared
> object file: No such file or directory (ignored)
> [vcainternmpi01:25339] mca: base: component_find: unable to open
> /home/ndesai/mpicr/lib/openmpi/mca_crs_blcr: libcr.so.0: cannot open shared
> object file: No such file or directory (ignored)
> [vcainternmpi01:25340] mca: base: component_find: unable to open
> /home/ndesai/mpicr/lib/openmpi/mca_crs_blcr: libcr.so.0: cannot open shared
> object file: No such file or directory (ignored)
> [vcainternmpi01:25355] mca: base: component_find: unable to open
> /home/ndesai/mpicr/lib/openmpi/mca_crs_blcr: libcr.so.0: cannot open shared
> object file: No such file or directory (ignored)
> [vcainternmpi01:25359] mca: base: component_find: unable to open
> /home/ndesai/mpicr/lib/openmpi/mca_crs_blcr: libcr.so.0: cannot open shared
> object file: No such file or directory (ignored)
> [vcainternmpi01:25357] mca: base: component_find: unable to open
> /home/ndesai/mpicr/lib/openmpi/mca_crs_blcr: libcr.so.0: cannot open shared
> object file: No such file or directory (ignored)
> [vcainternmpi01:25358] mca: base: component_find: unable to open
> /home/ndesai/mpicr/lib/openmpi/mca_crs_blcr: libcr.so.0: cannot open shared
> object file: No such file or directory (ignored)
> [vcainternmpi01:25362] mca: base: component_find: unable to open
> /home/ndesai/mpicr/lib/openmpi/mca_crs_blcr: libcr.so.0: cannot open shared
> object file: No such file or directory (ignored)
>
> I also checked the mca-params-conf file and all it contained were
> comments. Do I have to make any changes there for getting correct snapshots?
>
> Thanks a lot,
> Neel.
>
> On Fri, May 31, 2013 at 5:24 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>
>> Did you check the items on the list given in the error? I'm no expert on
>> ompi-checkpoint, but the error means that one of those conditions isn't
>> being met.
>>
>>
>> On May 31, 2013, at 4:54 PM, Neel Sunil Desai <Neel.Desai_at_[hidden]>
>> wrote:
>>
>> Hi Ralph,
>>
>> Thanks for the help. The path and ld_path were not set to the correct
>> location. I was able to execute the ompi-checkpoint command. But, I got the
>> following error.
>>
>> [ndesai_at_vcainternmpi01 ~]$ ompi-checkpoint 1803
>> --------------------------------------------------------------------------
>> Error: Unable to find the requested, active MPIRUN process on this
>> machine.
>> This could be due to one of the following:
>> - The jobid specified by the '--hnp-jobid' option is not
>> correct.
>> - The PID specified (1803) is not that of an active MPIRUN.
>> - The application with this PID is not checkpointable
>> - The application with this PID is not an Open MPI application.
>> - The session directory location could not be parsed.
>> ompi-checkpoint attempted to use the session directory:
>> /tmp/openmpi-sessions-ndesai_at_vcainternmpi01_0
>> Thanks,
>> Neel.
>>
>> On Fri, May 31, 2013 at 4:34 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>
>>> Check that your path and ld_library_path are set to point to the
>>> directory where you installed the version you built (the --prefix=<> you
>>> provided).
>>>
>>> On May 31, 2013, at 4:31 PM, Neel Sunil Desai <Neel.Desai_at_[hidden]>
>>> wrote:
>>>
>>> Hi Ralph,
>>>
>>> I did install open mpi with the --with-ft=cr option.
>>>
>>> Thanks,
>>> Neel.
>>>
>>> On Fri, May 31, 2013 at 4:25 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>
>>>> Okay, it should work it that version. It sounds like you didn't
>>>> configure OMPI with the --with-ft=cr option - yes? Take a look at
>>>> "./configure -h" for the ft-related options and ensure you build what you
>>>> need. C/R support is not built by default.
>>>>
>>>>
>>>> On May 31, 2013, at 3:59 PM, Neel Sunil Desai <Neel.Desai_at_[hidden]>
>>>> wrote:
>>>>
>>>> Open MPI 1.5.4
>>>>
>>>> On Fri, May 31, 2013 at 3:31 PM, Ralph Castain <rhc_at_[hidden]>wrote:
>>>>
>>>>> What OMPI version?
>>>>>
>>>>> On May 31, 2013, at 3:17 PM, Neel Sunil Desai <Neel.Desai_at_[hidden]>
>>>>> wrote:
>>>>>
>>>>> > Hi,
>>>>> >
>>>>> > I forgot to add. I watched the video of Joshua Hursey and when I
>>>>> type ompi_info | grep FT, I get FT Checkpoint Support: no ( checkpoint
>>>>> thread : no). I do not get anything when I type ompi_info | grep crs.
>>>>> >
>>>>> > Thanks,
>>>>> > Neel.
>>>>> > _______________________________________________
>>>>> > users mailing list
>>>>> > users_at_[hidden]
>>>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>