Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Checkpoint/Restart error
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2010-01-25 11:55:11


I tested the 1.4.1 release, and everything worked fine for me (tested
a few different configurations of nodes/environments).

The ompi-checkpoint error you cited is usually caused by one of two
things:
  - The PID specified is wrong (which I don't think that is the case
here)
  - The session directory cannot be found in /tmp.

So I think the problem is the latter. The session directory looks
something like:
   /tmp/openmpi-sessions-USERNAME_at_LOCALHOST_0
Within this directory the mpirun process places its contact
information. ompi-checkpoint uses this contact information to connect
to the job. If it cannot find it, then it errors out. (We definitely
need a better error message here. I filed a ticket [1]).

We usually do not recommend running Open MPI as a root user. So I
would strongly recommend that you do not run as a root user.

With a regular user, check the location of the session directory. Make
sure that it is in /tmp on the node where 'mpirun' and 'ompi-
checkpoint' are run.

-- Josh

[1] https://svn.open-mpi.org/trac/ompi/ticket/2189

On Jan 25, 2010, at 5:48 AM, Andreea Costea wrote:

> So? anyone? any clue?
>
> Summarize:
> - installed OpenMPI 1.4.1 on fresh Centos 5
> - mpirun works but ompi-checkpoint throws this error:
> ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line 405
> - on another VM I have OpenMPI 1.3.3. installed. Checkpointing works
> fine on guest but has the previous mentioned error on root. Both
> root and guest show the same output after "param -all -all" except
> for the $HOME (which only matters for mca_component_path,
> mca_param_files, snapc_base_global_snapshot_dir)
>
>
> Thanks,
> Andreea
>
>
> On Tue, Jan 19, 2010 at 9:01 PM, Andreea Costea <andre.costea_at_[hidden]
> > wrote:
> I noticed one more thing. As I still have some VMs that have OpenMPI
> version 1.3.3 installed I started to use those machines 'till I fix
> the problem with 1.4.1 And while checkpointing on one of this VMs I
> realized that checkpointing as a guest works fine and checkpointing
> as a root outputs the same error like in 1.4.1. : ORTE_ERROR_LOG:
> Not found in file orte-checkpoint.c at line 405
>
> I logged the outputs of "ompi_info --param all all" which I run for
> root and for another user and the only differences were at these
> parameters:
>
> mca_component_path
> mca_param_files
> snapc_base_global_snapshot_dir
>
> All 3 params differ because of the $HOME.
> One more thing: I don't have the directory $HOME/.openmpi
>
> Ideas?
>
> Thanks,
> Andreea
>
>
>
>
>
> On Tue, Jan 19, 2010 at 12:51 PM, Andreea Costea <andre.costea_at_[hidden]
> > wrote:
> Well... I decided to install a fresh OS to be sure that there is no
> OpenMPI version conflict. So I formatted one of my VMs, did a fresh
> CentOS install, installed BLCR 0.8.2 and OpenMPI 1.4.1 and the
> result: the same. mpirun works but ompi-checkpoint has that error at
> line 405:
>
> [[35906,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at
> line 405
>
> As for the files remaining after uninstalling: Jeff you were rigth.
> There is no file left, just some empty directories.
>
> Which might be the problem with that ORTE_ERROR_LOG error?
>
> Thanks,
> Andreea
>
> On Fri, Jan 15, 2010 at 11:47 PM, Andreea Costea <andre.costea_at_[hidden]
> > wrote:
> It's almost midnight here, so I left home, but I will try it tomorrow.
> There were some directories left after "make uninstall". I will give
> more details tomorrow.
>
> Thanks Jeff,
> Andreea
>
>
> On Fri, Jan 15, 2010 at 11:30 PM, Jeff Squyres <jsquyres_at_[hidden]>
> wrote:
> On Jan 15, 2010, at 8:07 AM, Andreea Costea wrote:
>
> > - I wanted to update to version 1.4.1 and I uninstalled previous
> version like this: make uninstall, and than manually deleted all the
> left over files. the directory where I installed was /usr/local
>
> I'll let Josh answer your CR questions, but I did want to ask about
> this point. AFAIK, "make uninstall" removes *all* Open MPI files.
> For example:
>
> -----
> [7:25] $ cd /path/to/my/OMPI/tree
> [7:25] $ make install > /dev/null
> [7:26] $ find /tmp/bogus/ -type f | wc
> 646 646 28082
> [7:26] $ make uninstall > /dev/null
> [7:27] $ find /tmp/bogus/ -type f | wc
> 0 0 0
> [7:27] $
> -----
>
> I realize that some *directories* are left in $prefix, but there
> should be no *files* left. Are you seeing something different?
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users