Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi-1.3a1r18241 ompi-restart issue
From: Sharon Brunett (sharon_at_[hidden])
Date: 2008-04-29 13:09:07


Josh,
Thanks for your inputs.
Yes, I'm able to restart properly outside the hostfile issues. The
problems were with the permissions on
   /var/run/nscd/passwd
The hostfile issues have now also been resolved...the problem was
interactions with maui/torque's hostfile and getting a proper hostfile
pushed to mpirun properly.

thanks for your help!
Sharon

Josh Hursey wrote:
> On Apr 25, 2008, at 6:12 PM, Sharon Brunett wrote:
>
>> Josh,
>> I'm responding to some outstanding questions about the env. I'm
>> trying to ompi-restart in.
>> My answers to your questions are sprinkled below, and include a few
>> more questions based on attempts I've made to get a multi-node
>> restart working.
>>
>> thanks,
>> Sharon
>>
>> Sharon Brunett wrote:
>>> Josh Hursey wrote:
>>>> On Apr 23, 2008, at 4:04 PM, Sharon Brunett wrote:
>>>>
>>>>> Hello,
>>>>> I'm using openmpi-1.3a1r18241 on a 2 node configuration and having
>>>>> troubles with the ompi-restart. I can successfully ompi-checkpoint
>>>>> and ompi-restart a 1 way mpi code.
>>>>> When I try a 2 way job running across 2 nodes, I get
>>>>>
>>>>> bash-2.05b$ ompi-restart -verbose ompi_global_snapshot_926.ckpt
>>>>> [shc005:01159] Checking for the existence of (/home/sharon/
>>>>> ompi_global_snapshot_926.ckpt)
>>>>> [shc005:01159] Restarting from file (ompi_global_snapshot_926.ckpt)
>>>>> [shc005:01159] Exec in self
>>>>> Restart failed: Permission denied
>>>>> Restart failed: Permission denied
>>>>>
>>>> This error is coming from BLCR. A few things to check.
>>>>
>>>> First take a look at /var/log/messages on the machine(s) you are
>>>> trying to restart on. Per:
>>>> http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#eperm
>>>>
>>>> Next check to make sure prelinking is turned off on the two machines
>>>> you are using. Per:
>>>> http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink
>>>>
>>>> Those will rule out some common BLCR problems. (more below)
>>>>
>>>>> If I try running as root, using the same snapshot file, the code
>>>>> restarts ok, but both tasks and up on the same node, rather than
>>>>> one
>>>>> per node (like the original mpirun).
>>>> You should never have to run as root to restart a process (or to run
>>>> Open MPI in any form). So I'm wondering if your user has permissions
>>>> to access the checkpoint files that BLCR is generating. You can look
>>>> at the permissions for the individual checkpoint files by looking
>>>> into
>>>> the checkpoint handler directory. They are a bit hidden, so
>>>> something
>>>> like the following should expose them:
>>>> -------------------
>>>> shell$ ls -la /home/sharon/ompi_global_snapshot_926.ckpt/0/
>>>> opal_snapshot_0.ckpt/
>>>> total 1756
>>>> drwx------ 2 sharon users 4096 Apr 23 16:29 .
>>>> drwx------ 4 sharon users 4096 Apr 23 16:29 ..
>>>> -rw------- 1 sharon users 1780180 Apr 23 16:29 ompi_blcr_context.
>>>> 31849
>>>> -rw-r--r-- 1 sharon users 35 Apr 23 16:29 snapshot_meta.data
>>>> shell$
>>>> shell$ ls -la /home/sharon/ompi_global_snapshot_926.ckpt/0/
>>>> opal_snapshot_1.ckpt/
>>>> total 1756
>>>> drwx------ 2 sharon users 4096 Apr 23 16:29 .
>>>> drwx------ 4 sharon users 4096 Apr 23 16:29 ..
>>>> -rw------- 1 sharon users 1780180 Apr 23 16:29 ompi_blcr_context.
>>>> 31850
>>>> -rw-r--r-- 1 sharon users 35 Apr 23 16:29 snapshot_meta.data
>>>> -------------------
>>>>
>>>> The BLCR generated context files are "ompi_blcr_context.PID", and
>>>> you
>>>> need to check to make sure that you have sufficient permissions to
>>>> access to those files (something like above).
>>>>
>>>>> I'm using BLCR version 0.6.5.
>>>>> I generate checkpoints via 'ompi-checkpoint pid'
>>>>> where pid is the pid of the mpirun task below
>>>>>
>>>>> mpirun -np 2 -am ft-enable-cr ./xhpl
>>>>>
>>>> Are you running in a managed environment (e.g., using Torque or
>>>> Slurm)? Odds are once you switched to root you lost your
>>>> environmental
>>>> symbols for your allocation (which is how Open MPI detects when to
>>>> use
>>>> an allocation). This would explain why the processes were
>>>> restarted on
>>>> one node instead of two.
>>>>
>> Maui/torque is the scheduler/resource manager combo being used. I
>> have been trying, to no avail, to push a machinefile (listing the
>> hostnames of the nodes given to me by maui/torque) at ompi-restart
>> which can in turn pass this on to mpirun. Any suggestions on how to
>> do this? --verbose passed to ompi-restart isn't very verbose about
>> what's going on.
>>
>
> If you pass '--help' to ompi-restart it will show you all the command
> line options for that command (following UNIX convention). To pass a
> hostfile to ompi-restart just use either the --hostfile or --
> machinefile options the same way you would orterun. ompi-restart will
> pass this to the orterun it starts up.
>
> There is one bug I'm trying to track at the moment with app context
> files. In the current trunk processes are not being mapped quite as
> consistently as they should be. You may be running into this problem,
> but I can't say for sure at the moment.
>
>
>>>> ompi-restart uses mpirun underneath to do the process launch in
>>>> exactly the same way the normal mpirun. So the mapping of processes
>>>> should be the same. That being said there is a bug that I'm tracking
>>>> in which they are not. This bug has nothing to do with restarting
>>>> processes, and more with a bookkeeping error when using app files.
>>>>
>>>>
>> Right, I doubt the bug has anything to do with my basic problems of
>> not launching the mpi tasks across 2 nodes rather than just the node
>> mpirun is sitting on.
>
> Did you check the permissions of the resulting checkpoint files to
> make sure that you have the proper access to them?
>
> So am I a little confused, are you now able to restart properly now
> outside of the hostfile issue described above?
>
> -- Josh
>
>>
>> Thanks,
>> Sharon
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>