1) Josh, the main developer for checkpoint/restart, has been away for
a few weeks
and has just returned. I suspect he will get unburied from e-mail in
another day or two.
2) The 1.4 (and 1.3) branch is very much under rapid development, and
there will be times
when basic functionality will just break for a day or so. If you run
into a problem, please try
to be more specific about what version (include the r#) that you tried.
3) The checkpoint/restart functionality currently only supports a
subset of the network
transports. I think all that you should expect to work right now is
TCP and shared memory.
Josh is working on other transports, but those are very much a "work
On Wed, Aug 20, 2008 at 4:11 AM, Matthias Hovestadt
> Hi Gabriele!
>> In this case, mpirun works well, but the checkpoint procedure fails:
>> ompi-checkpoint 20109
>> [node0316:20134] Error: Unable to get the current working directory
>> [node0316:20134] [[42404,0],0] ORTE_ERROR_LOG: Not found in file
>> orte-checkpoint.c at line 395
>> [node0316:20134] HNP with PID 20109 Not found!
> I had exactly the same problem on my machine. Neither modifying
> the configure parameters nor the way of invoking the ompi-checkpoint
> command did help. Since I am using the source from subversion checkout,
> I also updated the source several times, following the day to day
> progress. However, this problem remained.
> Luckily, updating the source to SVN revision 19265 finally solved
> this checkpointing issue. Maybe the problem shows up again in later
> users mailing list
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
tmattox_at_[hidden] || timattox_at_[hidden]
I'm a bright... http://www.the-brights.net/