Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] blcr_checkpoint_peer: execvp returned -1
From: Leonardo Fialho (lfialho_at_[hidden])
Date: 2008-04-28 06:10:55


Changing some parameters (blcr_checkpoint_cmd):

[aogrd01:08552] crs:blcr: checkpoint(8552, ---)
[aogrd01:08552] crs:blcr: checkpoint_peer(8552, --)
[aogrd01:08552] crs:blcr: get_checkpoint_filename(--, 8552)
[aogrd01:08552] crs:blcr: checkpoint_cmd(8552)
[aogrd01:08552] crs:blcr: blcr_checkpoint_peer: exec
:(/softs/blcr-0.6.5/bin/cr_checkpoint,
/softs/blcr-0.6.5/bin/cr_checkpoint --pid 8552 --file
/tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552):
[aogrd01:08552] crs:blcr: thread_callback()
[aogrd01:08552] crs:blcr: thread_callback: Continue.
[aogrd01:08552] crs:blcr: blcr_checkpoint_peer: Thread finished with
status 2
Checkpoint failed: Bad file descriptor
chmod: cannot access `/tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552':
No such file or directory
[aogrd01:08552] crs:blcr: move(): Error: Unable to execute the command
<chmod u+rwX /tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552> :[256].
crs:blcr chmod: Resource temporarily unavailable
[aogrd01:08552] crs:blcr: checkpoint(): Error: Unable to chmod the
checkpoint file (ompi_blcr_context.8552 in the directory
(/tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552) :[256].
crs:blcr: checkpoint: Invalid argument
[aogrd01:08552] opal_cr: inc_core: Error: The checkpoint failed. 256

BLCR don´t generate the context file
(/tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552). If I execute the
checkpoint command manually (/softs/blcr-0.6.5/bin/cr_checkpoint --pid
8552 --file /tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552) it returns
the same error: Checkpoint failed: Bad file descriptor

Thanks,
Leonardo Fialho

Leonardo Fialho escribió:
> Hi All,
>
> Does anybody experiment this error?
>
> [aogrdini:09070] Global) Receive a command message from [[13242,0],0].
> ...
> [aogrd02:07642] Local) Receive a command message.
> ...
> [aogrd01:07938] Local) Receive a command message.
> ...
> [aogrd01:07941] App) signal_handler: Receive Checkpoint Request.
> ...
> [aogrd02:07645] App) signal_handler: Receive Checkpoint Request.
> ...
> [aogrd01:07941] crs:blcr: checkpoint(7941, ---)
> [aogrd01:07941] crs:blcr: checkpoint_peer(7941, --)
> [aogrd01:07941] crs:blcr: get_checkpoint_filename(--, 7941)
> [aogrd01:07941] crs:blcr: checkpoint_cmd(7941)
> [aogrd01:07941] crs:blcr: blcr_checkpoint_peer: exec :(cr_checkpoint,
> cr_checkpoint --pid 7941 --file
> /tmp/opal_snapshot_0.ckpt/ompi_blcr_context.7941):
> [aogrd01:07941] crs:blcr: blcr_checkpoint_peer: Child failed to
> execute :(-1):
> [aogrd01:07941] crs:blcr: blcr_checkpoint_peer: execvp returned -1
> ...
> [aogrd02:07645] crs:blcr: blcr_checkpoint_peer: exec :(cr_checkpoint,
> cr_checkpoint --pid 7645 --file
> /tmp/opal_snapshot_1.ckpt/ompi_blcr_context.7645):
> [aogrd02:07645] crs:blcr: blcr_checkpoint_peer: Child failed to
> execute :(-1):
> [aogrd02:07645] crs:blcr: blcr_checkpoint_peer: execvp returned -1
> ...
> [aogrd02:07642] Local) Location: [/tmp/opal_snapshot_1.ckpt]
>
> The application stop here and don´t continue the execution. It´s
> using libcr version 0.6.5
> $ lsof -p 7518
> /softs/blcr-0.6.5/0.6.5/lib/libcr.so.0.2.1
>
> After orte-checkpoint command the application process is duplicated on
> the nodes, like a child of the original process.
> When a run an application with this version and take a checkpoint
> manually, I have no problem...
>
> Leonardo Fialho
> Computer Architecture and Operating Systems Department - CAOS
> Universidad Autonoma de Barcelona - UAB
> ETSE, Edifcio Q, QC/3088
> http://www.caos
> Phone: +34-93-581-2888
> Fax: +34-93-581-2478
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478