Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] blcr_checkpoint_peer: execvp returned -1
From: Leonardo Fialho (lfialho_at_[hidden])
Date: 2008-04-29 08:18:29


Josh,

Yesterday at night I made some changes, checkout a new SVN version, and
revise completely the BLCR installation. It´s working fine. I suspect 2
different things:

1) cache or old files (configured with older BLCR version path) in
autom4te, configure or dependencies;
2) some miss configuration in BLCR headers file.

When I checkpoint/restart non-MPI application, such applications,
probably, uses the correct libraries, but BLCR module was probably
compiled with older headers (cache?).

I´m trying to perform the error again, but before these changes (when
it´s not working) BLCR returns the "bad file descriptor" (EBAFD) error,
and the blcr module don´t catch this error, only return (-1) "child failed".

Thanks,
Leonardo Fialho

Josh Hursey escribió:
> I don't think I have ever seen this one before. :(
>
> So you are trying to checkpoint the MPI process by hand or a non-MPI
> process? Can you confirm that you can successfully checkpoint/restart
> a non-MPI process on these machines? What version of the Open MPI
> trunk are you using? Have you made any changes to the trunk to produce
> this build?
>
> Can you send me the info described here (off-list is ok):
> http://www.open-mpi.org/community/help/
>
> -- Josh
>
> On Apr 28, 2008, at 5:10 AM, Leonardo Fialho wrote:
>
>
>> Changing some parameters (blcr_checkpoint_cmd):
>>
>> [aogrd01:08552] crs:blcr: checkpoint(8552, ---)
>> [aogrd01:08552] crs:blcr: checkpoint_peer(8552, --)
>> [aogrd01:08552] crs:blcr: get_checkpoint_filename(--, 8552)
>> [aogrd01:08552] crs:blcr: checkpoint_cmd(8552)
>> [aogrd01:08552] crs:blcr: blcr_checkpoint_peer: exec
>> :(/softs/blcr-0.6.5/bin/cr_checkpoint,
>> /softs/blcr-0.6.5/bin/cr_checkpoint --pid 8552 --file
>> /tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552):
>> [aogrd01:08552] crs:blcr: thread_callback()
>> [aogrd01:08552] crs:blcr: thread_callback: Continue.
>> [aogrd01:08552] crs:blcr: blcr_checkpoint_peer: Thread finished with
>> status 2
>> Checkpoint failed: Bad file descriptor
>> chmod: cannot access `/tmp/opal_snapshot_0.ckpt/ompi_blcr_context.
>> 8552':
>> No such file or directory
>> [aogrd01:08552] crs:blcr: move(): Error: Unable to execute the command
>> <chmod u+rwX /tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552> :
>> [256].
>> crs:blcr chmod: Resource temporarily unavailable
>> [aogrd01:08552] crs:blcr: checkpoint(): Error: Unable to chmod the
>> checkpoint file (ompi_blcr_context.8552 in the directory
>> (/tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552) :[256].
>> crs:blcr: checkpoint: Invalid argument
>> [aogrd01:08552] opal_cr: inc_core: Error: The checkpoint failed. 256
>>
>> BLCR don´t generate the context file
>> (/tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552). If I execute the
>> checkpoint command manually (/softs/blcr-0.6.5/bin/cr_checkpoint --pid
>> 8552 --file /tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552) it
>> returns
>> the same error: Checkpoint failed: Bad file descriptor
>>
>> Thanks,
>> Leonardo Fialho
>>
>> Leonardo Fialho escribió:
>>
>>> Hi All,
>>>
>>> Does anybody experiment this error?
>>>
>>> [aogrdini:09070] Global) Receive a command message from [[13242,0],
>>> 0].
>>> ...
>>> [aogrd02:07642] Local) Receive a command message.
>>> ...
>>> [aogrd01:07938] Local) Receive a command message.
>>> ...
>>> [aogrd01:07941] App) signal_handler: Receive Checkpoint Request.
>>> ...
>>> [aogrd02:07645] App) signal_handler: Receive Checkpoint Request.
>>> ...
>>> [aogrd01:07941] crs:blcr: checkpoint(7941, ---)
>>> [aogrd01:07941] crs:blcr: checkpoint_peer(7941, --)
>>> [aogrd01:07941] crs:blcr: get_checkpoint_filename(--, 7941)
>>> [aogrd01:07941] crs:blcr: checkpoint_cmd(7941)
>>> [aogrd01:07941] crs:blcr: blcr_checkpoint_peer: exec :(cr_checkpoint,
>>> cr_checkpoint --pid 7941 --file
>>> /tmp/opal_snapshot_0.ckpt/ompi_blcr_context.7941):
>>> [aogrd01:07941] crs:blcr: blcr_checkpoint_peer: Child failed to
>>> execute :(-1):
>>> [aogrd01:07941] crs:blcr: blcr_checkpoint_peer: execvp returned -1
>>> ...
>>> [aogrd02:07645] crs:blcr: blcr_checkpoint_peer: exec :(cr_checkpoint,
>>> cr_checkpoint --pid 7645 --file
>>> /tmp/opal_snapshot_1.ckpt/ompi_blcr_context.7645):
>>> [aogrd02:07645] crs:blcr: blcr_checkpoint_peer: Child failed to
>>> execute :(-1):
>>> [aogrd02:07645] crs:blcr: blcr_checkpoint_peer: execvp returned -1
>>> ...
>>> [aogrd02:07642] Local) Location: [/tmp/opal_snapshot_1.ckpt]
>>>
>>> The application stop here and don´t continue the execution. It´s
>>> using libcr version 0.6.5
>>> $ lsof -p 7518
>>> /softs/blcr-0.6.5/0.6.5/lib/libcr.so.0.2.1
>>>
>>> After orte-checkpoint command the application process is duplicated
>>> on
>>> the nodes, like a child of the original process.
>>> When a run an application with this version and take a checkpoint
>>> manually, I have no problem...
>>>
>>> Leonardo Fialho
>>> Computer Architecture and Operating Systems Department - CAOS
>>> Universidad Autonoma de Barcelona - UAB
>>> ETSE, Edifcio Q, QC/3088
>>> http://www.caos
>>> Phone: +34-93-581-2888
>>> Fax: +34-93-581-2478
>>>
>> Leonardo Fialho
>> Computer Architecture and Operating Systems Department - CAOS
>> Universidad Autonoma de Barcelona - UAB
>> ETSE, Edifcio Q, QC/3088
>> http://www.caos.uab.es
>> Phone: +34-93-581-2888
>> Fax: +34-93-581-2478
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478