Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Checkpoint problem
From: Gabriele Fatigati (g.fatigati_at_[hidden])
Date: 2008-08-23 10:57:35


Well,
as you've suggested i've installed latest version of OpenMPi nigthly:
1.4a1r19370 version.

Now, checkpoint procedure works well, and related restart files are
correctly created, but process restart fails. After restart command, the
process starts, but remains frozen doing nothing, and die.

I'm working with TCP over 4 procs.

Any idea?

2008/8/20 Tim Mattox <timattox_at_[hidden]>

> Hello,
> Three things...
> 1) Josh, the main developer for checkpoint/restart, has been away for
> a few weeks
> and has just returned. I suspect he will get unburied from e-mail in
> another day or two.
>
> 2) The 1.4 (and 1.3) branch is very much under rapid development, and
> there will be times
> when basic functionality will just break for a day or so. If you run
> into a problem, please try
> to be more specific about what version (include the r#) that you tried.
>
> 3) The checkpoint/restart functionality currently only supports a
> subset of the network
> transports. I think all that you should expect to work right now is
> TCP and shared memory.
> Josh is working on other transports, but those are very much a "work
> in progress".
>
> On Wed, Aug 20, 2008 at 4:11 AM, Matthias Hovestadt
> <maho_at_[hidden]> wrote:
> > Hi Gabriele!
> >
> >> In this case, mpirun works well, but the checkpoint procedure fails:
> >>
> >> ompi-checkpoint 20109
> >> [node0316:20134] Error: Unable to get the current working directory
> >> [node0316:20134] [[42404,0],0] ORTE_ERROR_LOG: Not found in file
> >> orte-checkpoint.c at line 395
> >> [node0316:20134] HNP with PID 20109 Not found!
> >
> > I had exactly the same problem on my machine. Neither modifying
> > the configure parameters nor the way of invoking the ompi-checkpoint
> > command did help. Since I am using the source from subversion checkout,
> > I also updated the source several times, following the day to day
> > progress. However, this problem remained.
> >
> > Luckily, updating the source to SVN revision 19265 finally solved
> > this checkpointing issue. Maybe the problem shows up again in later
> > versions...
> >
> >
> > Best,
> > Matthias
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
>
> --
> Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
> tmattox_at_[hidden] || timattox_at_[hidden]
> I'm a bright... http://www.the-brights.net/
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>

-- 
Gabriele Fatigati
CINECA Systems & Tecnologies Department
Supercomputing Group
Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
www.cineca.it Tel: +39 051 6171722
g.fatigati_at_[hidden]