Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Open MPI Checkpoint Restart
From: Neel Sunil Desai (Neel.Desai_at_[hidden])
Date: 2013-06-04 11:57:45


Hi,

So, I was able to remove the "cannot open shared file or object" errors.
But I am not able to checkpoint yet. When I enter ompi-checkpoint PID of
mpirun, it does not return anything (not even a new prompt). In my
mca-params.conf file, I added

sstore=stage

sstore_stage_local_snapshot_dir=/tmp/ndesai/local
sstore_base_global_snapshot_dir=/tmp/ndesai/global

I created the local and global folders myself.

I am running all the processes on a single machine.
What am I doing wrong? Please guide me.

Thanks,
Neel.

On Mon, Jun 3, 2013 at 9:34 AM, Neel Sunil Desai <Neel.Desai_at_[hidden]>wrote:

> Hi Ralph.
>
> I checked the errors.
> I do not understand what the fololowing means : The session directory
> location could not be parsed.
>
> ompi-checkpoint attempted to use the session directory:
> /tmp/openmpi-sessions-ndesai_at_vcainternmpi01_0
> I opened the /tmp/openmpi-sessions-ndesai directory and various
> directories are created.
>
> Also, when I run the mpi program, I get the following errors before the
> program starts running correctly:
>
> [ndesai_at_vcainternmpi01 work]$ mpirun -am ft-enable-cr --np 16
> ./DecoderTest ../../decoder/test.ini
> [vcainternmpi01:25341] mca: base: component_find: unable to open
> /home/ndesai/mpicr/lib/openmpi/mca_crs_blcr: libcr.so.0: cannot open shared
> object file: No such file or directory (ignored)
> [vcainternmpi01:25342] mca: base: component_find: unable to open
> /home/ndesai/mpicr/lib/openmpi/mca_crs_blcr: libcr.so.0: cannot open shared
> object file: No such file or directory (ignored)
> [vcainternmpi01:25343] mca: base: component_find: unable to open
> /home/ndesai/mpicr/lib/openmpi/mca_crs_blcr: libcr.so.0: cannot open shared
> object file: No such file or directory (ignored)
> [vcainternmpi01:25344] mca: base: component_find: unable to open
> /home/ndesai/mpicr/lib/openmpi/mca_crs_blcr: libcr.so.0: cannot open shared
> object file: No such file or directory (ignored)
> [vcainternmpi01:25347] mca: base: component_find: unable to open
> /home/ndesai/mpicr/lib/openmpi/mca_crs_blcr: libcr.so.0: cannot open shared
> object file: No such file or directory (ignored)
> [vcainternmpi01:25354] mca: base: component_find: unable to open
> /home/ndesai/mpicr/lib/openmpi/mca_crs_blcr: libcr.so.0: cannot open shared
> object file: No such file or directory (ignored)
> [vcainternmpi01:25356] mca: base: component_find: unable to open
> /home/ndesai/mpicr/lib/openmpi/mca_crs_blcr: libcr.so.0: cannot open shared
> object file: No such file or directory (ignored)
> [vcainternmpi01:25337] mca: base: component_find: unable to open
> /home/ndesai/mpicr/lib/openmpi/mca_crs_blcr: libcr.so.0: cannot open shared
> object file: No such file or directory (ignored)
> [vcainternmpi01:25338] mca: base: component_find: unable to open
> /home/ndesai/mpicr/lib/openmpi/mca_crs_blcr: libcr.so.0: cannot open shared
> object file: No such file or directory (ignored)
> [vcainternmpi01:25339] mca: base: component_find: unable to open
> /home/ndesai/mpicr/lib/openmpi/mca_crs_blcr: libcr.so.0: cannot open shared
> object file: No such file or directory (ignored)
> [vcainternmpi01:25340] mca: base: component_find: unable to open
> /home/ndesai/mpicr/lib/openmpi/mca_crs_blcr: libcr.so.0: cannot open shared
> object file: No such file or directory (ignored)
> [vcainternmpi01:25355] mca: base: component_find: unable to open
> /home/ndesai/mpicr/lib/openmpi/mca_crs_blcr: libcr.so.0: cannot open shared
> object file: No such file or directory (ignored)
> [vcainternmpi01:25359] mca: base: component_find: unable to open
> /home/ndesai/mpicr/lib/openmpi/mca_crs_blcr: libcr.so.0: cannot open shared
> object file: No such file or directory (ignored)
> [vcainternmpi01:25357] mca: base: component_find: unable to open
> /home/ndesai/mpicr/lib/openmpi/mca_crs_blcr: libcr.so.0: cannot open shared
> object file: No such file or directory (ignored)
> [vcainternmpi01:25358] mca: base: component_find: unable to open
> /home/ndesai/mpicr/lib/openmpi/mca_crs_blcr: libcr.so.0: cannot open shared
> object file: No such file or directory (ignored)
> [vcainternmpi01:25362] mca: base: component_find: unable to open
> /home/ndesai/mpicr/lib/openmpi/mca_crs_blcr: libcr.so.0: cannot open shared
> object file: No such file or directory (ignored)
>
> I also checked the mca-params-conf file and all it contained were
> comments. Do I have to make any changes there for getting correct snapshots?
>
> Thanks a lot,
> Neel.
>
> On Fri, May 31, 2013 at 5:24 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>
>> Did you check the items on the list given in the error? I'm no expert on
>> ompi-checkpoint, but the error means that one of those conditions isn't
>> being met.
>>
>>
>> On May 31, 2013, at 4:54 PM, Neel Sunil Desai <Neel.Desai_at_[hidden]>
>> wrote:
>>
>> Hi Ralph,
>>
>> Thanks for the help. The path and ld_path were not set to the correct
>> location. I was able to execute the ompi-checkpoint command. But, I got the
>> following error.
>>
>> [ndesai_at_vcainternmpi01 ~]$ ompi-checkpoint 1803
>> --------------------------------------------------------------------------
>> Error: Unable to find the requested, active MPIRUN process on this
>> machine.
>> This could be due to one of the following:
>> - The jobid specified by the '--hnp-jobid' option is not
>> correct.
>> - The PID specified (1803) is not that of an active MPIRUN.
>> - The application with this PID is not checkpointable
>> - The application with this PID is not an Open MPI application.
>> - The session directory location could not be parsed.
>> ompi-checkpoint attempted to use the session directory:
>> /tmp/openmpi-sessions-ndesai_at_vcainternmpi01_0
>> Thanks,
>> Neel.
>>
>> On Fri, May 31, 2013 at 4:34 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>
>>> Check that your path and ld_library_path are set to point to the
>>> directory where you installed the version you built (the --prefix=<> you
>>> provided).
>>>
>>> On May 31, 2013, at 4:31 PM, Neel Sunil Desai <Neel.Desai_at_[hidden]>
>>> wrote:
>>>
>>> Hi Ralph,
>>>
>>> I did install open mpi with the --with-ft=cr option.
>>>
>>> Thanks,
>>> Neel.
>>>
>>> On Fri, May 31, 2013 at 4:25 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>
>>>> Okay, it should work it that version. It sounds like you didn't
>>>> configure OMPI with the --with-ft=cr option - yes? Take a look at
>>>> "./configure -h" for the ft-related options and ensure you build what you
>>>> need. C/R support is not built by default.
>>>>
>>>>
>>>> On May 31, 2013, at 3:59 PM, Neel Sunil Desai <Neel.Desai_at_[hidden]>
>>>> wrote:
>>>>
>>>> Open MPI 1.5.4
>>>>
>>>> On Fri, May 31, 2013 at 3:31 PM, Ralph Castain <rhc_at_[hidden]>wrote:
>>>>
>>>>> What OMPI version?
>>>>>
>>>>> On May 31, 2013, at 3:17 PM, Neel Sunil Desai <Neel.Desai_at_[hidden]>
>>>>> wrote:
>>>>>
>>>>> > Hi,
>>>>> >
>>>>> > I forgot to add. I watched the video of Joshua Hursey and when I
>>>>> type ompi_info | grep FT, I get FT Checkpoint Support: no ( checkpoint
>>>>> thread : no). I do not get anything when I type ompi_info | grep crs.
>>>>> >
>>>>> > Thanks,
>>>>> > Neel.
>>>>> > _______________________________________________
>>>>> > users mailing list
>>>>> > users_at_[hidden]
>>>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>