Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] blcr_checkpoint_peer: execvp returned -1
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2008-04-28 20:57:37


I don't think I have ever seen this one before. :(

So you are trying to checkpoint the MPI process by hand or a non-MPI
process? Can you confirm that you can successfully checkpoint/restart
a non-MPI process on these machines? What version of the Open MPI
trunk are you using? Have you made any changes to the trunk to produce
this build?

Can you send me the info described here (off-list is ok):
  http://www.open-mpi.org/community/help/

-- Josh

On Apr 28, 2008, at 5:10 AM, Leonardo Fialho wrote:

> Changing some parameters (blcr_checkpoint_cmd):
>
> [aogrd01:08552] crs:blcr: checkpoint(8552, ---)
> [aogrd01:08552] crs:blcr: checkpoint_peer(8552, --)
> [aogrd01:08552] crs:blcr: get_checkpoint_filename(--, 8552)
> [aogrd01:08552] crs:blcr: checkpoint_cmd(8552)
> [aogrd01:08552] crs:blcr: blcr_checkpoint_peer: exec
> :(/softs/blcr-0.6.5/bin/cr_checkpoint,
> /softs/blcr-0.6.5/bin/cr_checkpoint --pid 8552 --file
> /tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552):
> [aogrd01:08552] crs:blcr: thread_callback()
> [aogrd01:08552] crs:blcr: thread_callback: Continue.
> [aogrd01:08552] crs:blcr: blcr_checkpoint_peer: Thread finished with
> status 2
> Checkpoint failed: Bad file descriptor
> chmod: cannot access `/tmp/opal_snapshot_0.ckpt/ompi_blcr_context.
> 8552':
> No such file or directory
> [aogrd01:08552] crs:blcr: move(): Error: Unable to execute the command
> <chmod u+rwX /tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552> :
> [256].
> crs:blcr chmod: Resource temporarily unavailable
> [aogrd01:08552] crs:blcr: checkpoint(): Error: Unable to chmod the
> checkpoint file (ompi_blcr_context.8552 in the directory
> (/tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552) :[256].
> crs:blcr: checkpoint: Invalid argument
> [aogrd01:08552] opal_cr: inc_core: Error: The checkpoint failed. 256
>
> BLCR don´t generate the context file
> (/tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552). If I execute the
> checkpoint command manually (/softs/blcr-0.6.5/bin/cr_checkpoint --pid
> 8552 --file /tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552) it
> returns
> the same error: Checkpoint failed: Bad file descriptor
>
> Thanks,
> Leonardo Fialho
>
> Leonardo Fialho escribió:
>> Hi All,
>>
>> Does anybody experiment this error?
>>
>> [aogrdini:09070] Global) Receive a command message from [[13242,0],
>> 0].
>> ...
>> [aogrd02:07642] Local) Receive a command message.
>> ...
>> [aogrd01:07938] Local) Receive a command message.
>> ...
>> [aogrd01:07941] App) signal_handler: Receive Checkpoint Request.
>> ...
>> [aogrd02:07645] App) signal_handler: Receive Checkpoint Request.
>> ...
>> [aogrd01:07941] crs:blcr: checkpoint(7941, ---)
>> [aogrd01:07941] crs:blcr: checkpoint_peer(7941, --)
>> [aogrd01:07941] crs:blcr: get_checkpoint_filename(--, 7941)
>> [aogrd01:07941] crs:blcr: checkpoint_cmd(7941)
>> [aogrd01:07941] crs:blcr: blcr_checkpoint_peer: exec :(cr_checkpoint,
>> cr_checkpoint --pid 7941 --file
>> /tmp/opal_snapshot_0.ckpt/ompi_blcr_context.7941):
>> [aogrd01:07941] crs:blcr: blcr_checkpoint_peer: Child failed to
>> execute :(-1):
>> [aogrd01:07941] crs:blcr: blcr_checkpoint_peer: execvp returned -1
>> ...
>> [aogrd02:07645] crs:blcr: blcr_checkpoint_peer: exec :(cr_checkpoint,
>> cr_checkpoint --pid 7645 --file
>> /tmp/opal_snapshot_1.ckpt/ompi_blcr_context.7645):
>> [aogrd02:07645] crs:blcr: blcr_checkpoint_peer: Child failed to
>> execute :(-1):
>> [aogrd02:07645] crs:blcr: blcr_checkpoint_peer: execvp returned -1
>> ...
>> [aogrd02:07642] Local) Location: [/tmp/opal_snapshot_1.ckpt]
>>
>> The application stop here and don´t continue the execution. It´s
>> using libcr version 0.6.5
>> $ lsof -p 7518
>> /softs/blcr-0.6.5/0.6.5/lib/libcr.so.0.2.1
>>
>> After orte-checkpoint command the application process is duplicated
>> on
>> the nodes, like a child of the original process.
>> When a run an application with this version and take a checkpoint
>> manually, I have no problem...
>>
>> Leonardo Fialho
>> Computer Architecture and Operating Systems Department - CAOS
>> Universidad Autonoma de Barcelona - UAB
>> ETSE, Edifcio Q, QC/3088
>> http://www.caos
>> Phone: +34-93-581-2888
>> Fax: +34-93-581-2478
> Leonardo Fialho
> Computer Architecture and Operating Systems Department - CAOS
> Universidad Autonoma de Barcelona - UAB
> ETSE, Edifcio Q, QC/3088
> http://www.caos.uab.es
> Phone: +34-93-581-2888
> Fax: +34-93-581-2478
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users