Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi-1.3a1r18241 ompi-restart issue
From: Sharon Brunett (sharon_at_[hidden])
Date: 2008-04-23 17:28:27


Josh Hursey wrote:
> On Apr 23, 2008, at 4:04 PM, Sharon Brunett wrote:
>
>> Hello,
>> I'm using openmpi-1.3a1r18241 on a 2 node configuration and having
>> troubles with the ompi-restart. I can successfully ompi-checkpoint
>> and ompi-restart a 1 way mpi code.
>> When I try a 2 way job running across 2 nodes, I get
>>
>> bash-2.05b$ ompi-restart -verbose ompi_global_snapshot_926.ckpt
>> [shc005:01159] Checking for the existence of (/home/sharon/
>> ompi_global_snapshot_926.ckpt)
>> [shc005:01159] Restarting from file (ompi_global_snapshot_926.ckpt)
>> [shc005:01159] Exec in self
>> Restart failed: Permission denied
>> Restart failed: Permission denied
>>
>
> This error is coming from BLCR. A few things to check.
>
> First take a look at /var/log/messages on the machine(s) you are
> trying to restart on. Per:
> http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#eperm
>
> Next check to make sure prelinking is turned off on the two machines
> you are using. Per:
> http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink
>
> Those will rule out some common BLCR problems. (more below)
>
>>
>> If I try running as root, using the same snapshot file, the code
>> restarts ok, but both tasks and up on the same node, rather than one
>> per node (like the original mpirun).
>
> You should never have to run as root to restart a process (or to run
> Open MPI in any form). So I'm wondering if your user has permissions
> to access the checkpoint files that BLCR is generating. You can look
> at the permissions for the individual checkpoint files by looking into
> the checkpoint handler directory. They are a bit hidden, so something
> like the following should expose them:
> -------------------
> shell$ ls -la /home/sharon/ompi_global_snapshot_926.ckpt/0/
> opal_snapshot_0.ckpt/
> total 1756
> drwx------ 2 sharon users 4096 Apr 23 16:29 .
> drwx------ 4 sharon users 4096 Apr 23 16:29 ..
> -rw------- 1 sharon users 1780180 Apr 23 16:29 ompi_blcr_context.31849
> -rw-r--r-- 1 sharon users 35 Apr 23 16:29 snapshot_meta.data
> shell$
> shell$ ls -la /home/sharon/ompi_global_snapshot_926.ckpt/0/
> opal_snapshot_1.ckpt/
> total 1756
> drwx------ 2 sharon users 4096 Apr 23 16:29 .
> drwx------ 4 sharon users 4096 Apr 23 16:29 ..
> -rw------- 1 sharon users 1780180 Apr 23 16:29 ompi_blcr_context.31850
> -rw-r--r-- 1 sharon users 35 Apr 23 16:29 snapshot_meta.data
> -------------------
>
> The BLCR generated context files are "ompi_blcr_context.PID", and you
> need to check to make sure that you have sufficient permissions to
> access to those files (something like above).
>
>>
>> I'm using BLCR version 0.6.5.
>> I generate checkpoints via 'ompi-checkpoint pid'
>> where pid is the pid of the mpirun task below
>>
>> mpirun -np 2 -am ft-enable-cr ./xhpl
>>
>
> Are you running in a managed environment (e.g., using Torque or
> Slurm)? Odds are once you switched to root you lost your environmental
> symbols for your allocation (which is how Open MPI detects when to use
> an allocation). This would explain why the processes were restarted on
> one node instead of two.
>
> ompi-restart uses mpirun underneath to do the process launch in
> exactly the same way the normal mpirun. So the mapping of processes
> should be the same. That being said there is a bug that I'm tracking
> in which they are not. This bug has nothing to do with restarting
> processes, and more with a bookkeeping error when using app files.
>
>
>> Thanks very much for any hints you can give on how to resolve either
>> of these problems.
>
> Let me know if this helps solve the problem. If not we might be able
> to try some other things.
>
> Cheers,
> Josh
>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
Thanks much..
vmadump: open('/var/run/nscd/passwd', 0x0) failed: -13
vmadump: mmap failed: /var/run/nscd/passwd

is indeed the problem, as shown by dmesg.