Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Checkpoint problem
From: Tim Mattox (timattox_at_[hidden])
Date: 2008-08-20 08:48:34

Three things...
1) Josh, the main developer for checkpoint/restart, has been away for
a few weeks
and has just returned. I suspect he will get unburied from e-mail in
another day or two.

2) The 1.4 (and 1.3) branch is very much under rapid development, and
there will be times
when basic functionality will just break for a day or so. If you run
into a problem, please try
to be more specific about what version (include the r#) that you tried.

3) The checkpoint/restart functionality currently only supports a
subset of the network
transports. I think all that you should expect to work right now is
TCP and shared memory.
Josh is working on other transports, but those are very much a "work
in progress".

On Wed, Aug 20, 2008 at 4:11 AM, Matthias Hovestadt
<maho_at_[hidden]> wrote:
> Hi Gabriele!
>> In this case, mpirun works well, but the checkpoint procedure fails:
>> ompi-checkpoint 20109
>> [node0316:20134] Error: Unable to get the current working directory
>> [node0316:20134] [[42404,0],0] ORTE_ERROR_LOG: Not found in file
>> orte-checkpoint.c at line 395
>> [node0316:20134] HNP with PID 20109 Not found!
> I had exactly the same problem on my machine. Neither modifying
> the configure parameters nor the way of invoking the ompi-checkpoint
> command did help. Since I am using the source from subversion checkout,
> I also updated the source several times, following the day to day
> progress. However, this problem remained.
> Luckily, updating the source to SVN revision 19265 finally solved
> this checkpointing issue. Maybe the problem shows up again in later
> versions...
> Best,
> Matthias
> _______________________________________________
> users mailing list
> users_at_[hidden]

Tim Mattox, Ph.D. -
 tmattox_at_[hidden] || timattox_at_[hidden]
 I'm a bright...