Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] openmpi-1.3a1r18241 ompi-restart issue
From: Sharon Brunett (sharon_at_[hidden])
Date: 2008-04-23 16:04:45

I'm using openmpi-1.3a1r18241 on a 2 node configuration and having troubles with the ompi-restart. I can successfully ompi-checkpoint and ompi-restart a 1 way mpi code.
When I try a 2 way job running across 2 nodes, I get

bash-2.05b$ ompi-restart -verbose ompi_global_snapshot_926.ckpt
[shc005:01159] Checking for the existence of (/home/sharon/ompi_global_snapshot_926.ckpt)
[shc005:01159] Restarting from file (ompi_global_snapshot_926.ckpt)
[shc005:01159] Exec in self
Restart failed: Permission denied
Restart failed: Permission denied

If I try running as root, using the same snapshot file, the code restarts ok, but both tasks and up on the same node, rather than one per node (like the original mpirun).

I'm using BLCR version 0.6.5.
I generate checkpoints via 'ompi-checkpoint pid'
where pid is the pid of the mpirun task below

mpirun -np 2 -am ft-enable-cr ./xhpl

Thanks very much for any hints you can give on how to resolve either of these problems.