Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Error while restarting a checkpoint
From: Leonardo Fialho (lfialho_at_[hidden])
Date: 2008-10-31 10:49:35


My suspects were confirmed. After a orte_iof_base_setup_child/parent the
problem does not occur.

Leonardo

Leonardo Fialho escribió:
> Hi All,
>
> I´m trying to restart a process from a previous checkpoint. My
> (modified) orted is trying to do this. Its uses the opal-restart
> command, but after cr_restart is called by CRS (crs:blcr:
> blcr_restart: SELF: exec :(cr_restart, cr_restart
> /tmp/radic//1/ompi_blcr_context.6507)) the SO freezes (kernel panic).
> The error generated at this moment is:
>
> "Restart failed: No such device or address"
>
> I think that it can be generated because the stdin/stdout/stderr from
> the checkpointed file points to undefined descriptor os something like
> this...
>
> Anybody can help about this? How can I close these descriptor before
> the checkpoint? The opal-restart open these descriptor too? What can I
> make to it works?
>
> Thanks,

-- 
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478