Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] blcr_checkpoint_peer: execvp returned -1
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2008-04-30 09:24:50


On Apr 29, 2008, at 7:18 AM, Leonardo Fialho wrote:

> Josh,
>
> Yesterday at night I made some changes, checkout a new SVN version,
> and
> revise completely the BLCR installation. It´s working fine. I
> suspect 2
> different things:
>
> 1) cache or old files (configured with older BLCR version path) in
> autom4te, configure or dependencies;
> 2) some miss configuration in BLCR headers file.
>
> When I checkpoint/restart non-MPI application, such applications,
> probably, uses the correct libraries, but BLCR module was probably
> compiled with older headers (cache?).
>
> I´m trying to perform the error again, but before these changes (when
> it´s not working) BLCR returns the "bad file descriptor" (EBAFD)
> error,
> and the blcr module don´t catch this error, only return (-1) "child
> failed".

I'll take a look at this and try to have the Open MPI BLCR module
return something more representative of the actual error message.

-- Josh

>
>
> Thanks,
> Leonardo Fialho
>
> Josh Hursey escribió:
>> I don't think I have ever seen this one before. :(
>>
>> So you are trying to checkpoint the MPI process by hand or a non-MPI
>> process? Can you confirm that you can successfully checkpoint/restart
>> a non-MPI process on these machines? What version of the Open MPI
>> trunk are you using? Have you made any changes to the trunk to
>> produce
>> this build?
>>
>> Can you send me the info described here (off-list is ok):
>> http://www.open-mpi.org/community/help/
>>
>> -- Josh
>>
>> On Apr 28, 2008, at 5:10 AM, Leonardo Fialho wrote:
>>
>>
>>> Changing some parameters (blcr_checkpoint_cmd):
>>>
>>> [aogrd01:08552] crs:blcr: checkpoint(8552, ---)
>>> [aogrd01:08552] crs:blcr: checkpoint_peer(8552, --)
>>> [aogrd01:08552] crs:blcr: get_checkpoint_filename(--, 8552)
>>> [aogrd01:08552] crs:blcr: checkpoint_cmd(8552)
>>> [aogrd01:08552] crs:blcr: blcr_checkpoint_peer: exec
>>> :(/softs/blcr-0.6.5/bin/cr_checkpoint,
>>> /softs/blcr-0.6.5/bin/cr_checkpoint --pid 8552 --file
>>> /tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552):
>>> [aogrd01:08552] crs:blcr: thread_callback()
>>> [aogrd01:08552] crs:blcr: thread_callback: Continue.
>>> [aogrd01:08552] crs:blcr: blcr_checkpoint_peer: Thread finished with
>>> status 2
>>> Checkpoint failed: Bad file descriptor
>>> chmod: cannot access `/tmp/opal_snapshot_0.ckpt/ompi_blcr_context.
>>> 8552':
>>> No such file or directory
>>> [aogrd01:08552] crs:blcr: move(): Error: Unable to execute the
>>> command
>>> <chmod u+rwX /tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552> :
>>> [256].
>>> crs:blcr chmod: Resource temporarily unavailable
>>> [aogrd01:08552] crs:blcr: checkpoint(): Error: Unable to chmod the
>>> checkpoint file (ompi_blcr_context.8552 in the directory
>>> (/tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552) :[256].
>>> crs:blcr: checkpoint: Invalid argument
>>> [aogrd01:08552] opal_cr: inc_core: Error: The checkpoint failed. 256
>>>
>>> BLCR don´t generate the context file
>>> (/tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552). If I execute the
>>> checkpoint command manually (/softs/blcr-0.6.5/bin/cr_checkpoint --
>>> pid
>>> 8552 --file /tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552) it
>>> returns
>>> the same error: Checkpoint failed: Bad file descriptor
>>>
>>> Thanks,
>>> Leonardo Fialho
>>>
>>> Leonardo Fialho escribió:
>>>
>>>> Hi All,
>>>>
>>>> Does anybody experiment this error?
>>>>
>>>> [aogrdini:09070] Global) Receive a command message from [[13242,0],
>>>> 0].
>>>> ...
>>>> [aogrd02:07642] Local) Receive a command message.
>>>> ...
>>>> [aogrd01:07938] Local) Receive a command message.
>>>> ...
>>>> [aogrd01:07941] App) signal_handler: Receive Checkpoint Request.
>>>> ...
>>>> [aogrd02:07645] App) signal_handler: Receive Checkpoint Request.
>>>> ...
>>>> [aogrd01:07941] crs:blcr: checkpoint(7941, ---)
>>>> [aogrd01:07941] crs:blcr: checkpoint_peer(7941, --)
>>>> [aogrd01:07941] crs:blcr: get_checkpoint_filename(--, 7941)
>>>> [aogrd01:07941] crs:blcr: checkpoint_cmd(7941)
>>>> [aogrd01:07941] crs:blcr: blcr_checkpoint_peer: exec :
>>>> (cr_checkpoint,
>>>> cr_checkpoint --pid 7941 --file
>>>> /tmp/opal_snapshot_0.ckpt/ompi_blcr_context.7941):
>>>> [aogrd01:07941] crs:blcr: blcr_checkpoint_peer: Child failed to
>>>> execute :(-1):
>>>> [aogrd01:07941] crs:blcr: blcr_checkpoint_peer: execvp returned -1
>>>> ...
>>>> [aogrd02:07645] crs:blcr: blcr_checkpoint_peer: exec :
>>>> (cr_checkpoint,
>>>> cr_checkpoint --pid 7645 --file
>>>> /tmp/opal_snapshot_1.ckpt/ompi_blcr_context.7645):
>>>> [aogrd02:07645] crs:blcr: blcr_checkpoint_peer: Child failed to
>>>> execute :(-1):
>>>> [aogrd02:07645] crs:blcr: blcr_checkpoint_peer: execvp returned -1
>>>> ...
>>>> [aogrd02:07642] Local) Location: [/tmp/
>>>> opal_snapshot_1.ckpt]
>>>>
>>>> The application stop here and don´t continue the execution. It´s
>>>> using libcr version 0.6.5
>>>> $ lsof -p 7518
>>>> /softs/blcr-0.6.5/0.6.5/lib/libcr.so.0.2.1
>>>>
>>>> After orte-checkpoint command the application process is duplicated
>>>> on
>>>> the nodes, like a child of the original process.
>>>> When a run an application with this version and take a checkpoint
>>>> manually, I have no problem...
>>>>
>>>> Leonardo Fialho
>>>> Computer Architecture and Operating Systems Department - CAOS
>>>> Universidad Autonoma de Barcelona - UAB
>>>> ETSE, Edifcio Q, QC/3088
>>>> http://www.caos
>>>> Phone: +34-93-581-2888
>>>> Fax: +34-93-581-2478
>>>>
>>> Leonardo Fialho
>>> Computer Architecture and Operating Systems Department - CAOS
>>> Universidad Autonoma de Barcelona - UAB
>>> ETSE, Edifcio Q, QC/3088
>>> http://www.caos.uab.es
>>> Phone: +34-93-581-2888
>>> Fax: +34-93-581-2478
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> --
> Leonardo Fialho
> Computer Architecture and Operating Systems Department - CAOS
> Universidad Autonoma de Barcelona - UAB
> ETSE, Edifcio Q, QC/3088
> http://www.caos.uab.es
> Phone: +34-93-581-2888
> Fax: +34-93-581-2478
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users