Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi-1.3a1r18241 ompi-restart issue
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2008-04-23 16:48:27


On Apr 23, 2008, at 4:04 PM, Sharon Brunett wrote:

> Hello,
> I'm using openmpi-1.3a1r18241 on a 2 node configuration and having
> troubles with the ompi-restart. I can successfully ompi-checkpoint
> and ompi-restart a 1 way mpi code.
> When I try a 2 way job running across 2 nodes, I get
>
> bash-2.05b$ ompi-restart -verbose ompi_global_snapshot_926.ckpt
> [shc005:01159] Checking for the existence of (/home/sharon/
> ompi_global_snapshot_926.ckpt)
> [shc005:01159] Restarting from file (ompi_global_snapshot_926.ckpt)
> [shc005:01159] Exec in self
> Restart failed: Permission denied
> Restart failed: Permission denied
>

This error is coming from BLCR. A few things to check.

First take a look at /var/log/messages on the machine(s) you are
trying to restart on. Per:
  http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#eperm

Next check to make sure prelinking is turned off on the two machines
you are using. Per:
  http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink

Those will rule out some common BLCR problems. (more below)

>
>
> If I try running as root, using the same snapshot file, the code
> restarts ok, but both tasks and up on the same node, rather than one
> per node (like the original mpirun).

You should never have to run as root to restart a process (or to run
Open MPI in any form). So I'm wondering if your user has permissions
to access the checkpoint files that BLCR is generating. You can look
at the permissions for the individual checkpoint files by looking into
the checkpoint handler directory. They are a bit hidden, so something
like the following should expose them:
-------------------
shell$ ls -la /home/sharon/ompi_global_snapshot_926.ckpt/0/
opal_snapshot_0.ckpt/
total 1756
drwx------ 2 sharon users 4096 Apr 23 16:29 .
drwx------ 4 sharon users 4096 Apr 23 16:29 ..
-rw------- 1 sharon users 1780180 Apr 23 16:29 ompi_blcr_context.31849
-rw-r--r-- 1 sharon users 35 Apr 23 16:29 snapshot_meta.data
shell$
shell$ ls -la /home/sharon/ompi_global_snapshot_926.ckpt/0/
opal_snapshot_1.ckpt/
total 1756
drwx------ 2 sharon users 4096 Apr 23 16:29 .
drwx------ 4 sharon users 4096 Apr 23 16:29 ..
-rw------- 1 sharon users 1780180 Apr 23 16:29 ompi_blcr_context.31850
-rw-r--r-- 1 sharon users 35 Apr 23 16:29 snapshot_meta.data
-------------------

The BLCR generated context files are "ompi_blcr_context.PID", and you
need to check to make sure that you have sufficient permissions to
access to those files (something like above).

>
>
> I'm using BLCR version 0.6.5.
> I generate checkpoints via 'ompi-checkpoint pid'
> where pid is the pid of the mpirun task below
>
> mpirun -np 2 -am ft-enable-cr ./xhpl
>

Are you running in a managed environment (e.g., using Torque or
Slurm)? Odds are once you switched to root you lost your environmental
symbols for your allocation (which is how Open MPI detects when to use
an allocation). This would explain why the processes were restarted on
one node instead of two.

ompi-restart uses mpirun underneath to do the process launch in
exactly the same way the normal mpirun. So the mapping of processes
should be the same. That being said there is a bug that I'm tracking
in which they are not. This bug has nothing to do with restarting
processes, and more with a bookkeeping error when using app files.

>
> Thanks very much for any hints you can give on how to resolve either
> of these problems.

Let me know if this helps solve the problem. If not we might be able
to try some other things.

Cheers,
Josh

>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users