Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] blcr_checkpoint_peer: execvp returned -1
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2008-04-30 09:24:50


On Apr 29, 2008, at 7:18 AM, Leonardo Fialho wrote:

> Josh,
>
> Yesterday at night I made some changes, checkout a new SVN version,
> and
> revise completely the BLCR installation. It´s working fine. I
> suspect 2
> different things:
>
> 1) cache or old files (configured with older BLCR version path) in
> autom4te, configure or dependencies;
> 2) some miss configuration in BLCR headers file.
>
> When I checkpoint/restart non-MPI application, such applications,
> probably, uses the correct libraries, but BLCR module was probably
> compiled with older headers (cache?).
>
> I´m trying to perform the error again, but before these changes (when
> it´s not working) BLCR returns the "bad file descriptor" (EBAFD)
> error,
> and the blcr module don´t catch this error, only return (-1) "child
> failed".

I'll take a look at this and try to have the Open MPI BLCR module
return something more representative of the actual error message.

-- Josh

>
>
> Thanks,
> Leonardo Fialho
>
> Josh Hursey escribió:
>> I don't think I have ever seen this one before. :(
>>
>> So you are trying to checkpoint the MPI process by hand or a non-MPI
>> process? Can you confirm that you can successfully checkpoint/restart
>> a non-MPI process on these machines? What version of the Open MPI
>> trunk are you using? Have you made any changes to the trunk to
>> produce
>> this build?
>>
>> Can you send me the info described here (off-list is ok):
>> http://www.open-mpi.org/community/help/
>>
>> -- Josh
>>
>> On Apr 28, 2008, at 5:10 AM, Leonardo Fialho wrote:
>>
>>
>>> Changing some parameters (blcr_checkpoint_cmd):
>>>
>>> [aogrd01:08552] crs:blcr: checkpoint(8552, ---)
>>> [aogrd01:08552] crs:blcr: checkpoint_peer(8552, --)
>>> [aogrd01:08552] crs:blcr: get_checkpoint_filename(--, 8552)
>>> [aogrd01:08552] crs:blcr: checkpoint_cmd(8552)
>>> [aogrd01:08552] crs:blcr: blcr_checkpoint_peer: exec
>>> :(/softs/blcr-0.6.5/bin/cr_checkpoint,
>>> /softs/blcr-0.6.5/bin/cr_checkpoint --pid 8552 --file
>>> /tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552):
>>> [aogrd01:08552] crs:blcr: thread_callback()
>>> [aogrd01:08552] crs:blcr: thread_callback: Continue.
>>> [aogrd01:08552] crs:blcr: blcr_checkpoint_peer: Thread finished with
>>> status 2
>>> Checkpoint failed: Bad file descriptor
>>> chmod: cannot access `/tmp/opal_snapshot_0.ckpt/ompi_blcr_context.
>>> 8552':
>>> No such file or directory
>>> [aogrd01:08552] crs:blcr: move(): Error: Unable to execute the
>>> command
>>> <chmod u+rwX /tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552> :
>>> [256].
>>> crs:blcr chmod: Resource temporarily unavailable
>>> [aogrd01:08552] crs:blcr: checkpoint(): Error: Unable to chmod the
>>> checkpoint file (ompi_blcr_context.8552 in the directory
>>> (/tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552) :[256].
>>> crs:blcr: checkpoint: Invalid argument
>>> [aogrd01:08552] opal_cr: inc_core: Error: The checkpoint failed. 256
>>>
>>> BLCR don´t generate the context file
>>> (/tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552). If I execute the
>>> checkpoint command manually (/softs/blcr-0.6.5/bin/cr_checkpoint --
>>> pid
>>> 8552 --file /tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552) it
>>> returns
>>> the same error: Checkpoint failed: Bad file descriptor
>>>
>>> Thanks,
>>> Leonardo Fialho
>>>
>>> Leonardo Fialho escribió:
>>>
>>>> Hi All,
>>>>
>>>> Does anybody experiment this error?
>>>>
>>>> [aogrdini:09070] Global) Receive a command message from [[13242,0],
>>>> 0].
>>>> ...
>>>> [aogrd02:07642] Local) Receive a command message.
>>>> ...
>>>> [aogrd01:07938] Local) Receive a command message.
>>>> ...
>>>> [aogrd01:07941] App) signal_handler: Receive Checkpoint Request.
>>>> ...
>>>> [aogrd02:07645] App) signal_handler: Receive Checkpoint Request.
>>>> ...
>>>> [aogrd01:07941] crs:blcr: checkpoint(7941, ---)
>>>> [aogrd01:07941] crs:blcr: checkpoint_peer(7941, --)
>>>> [aogrd01:07941] crs:blcr: get_checkpoint_filename(--, 7941)
>>>> [aogrd01:07941] crs:blcr: checkpoint_cmd(7941)
>>>> [aogrd01:07941] crs:blcr: blcr_checkpoint_peer: exec :
>>>> (cr_checkpoint,
>>>> cr_checkpoint --pid 7941 --file
>>>> /tmp/opal_snapshot_0.ckpt/ompi_blcr_context.7941):
>>>> [aogrd01:07941] crs:blcr: blcr_checkpoint_peer: Child failed to
>>>> execute :(-1):
>>>> [aogrd01:07941] crs:blcr: blcr_checkpoint_peer: execvp returned -1
>>>> ...
>>>> [aogrd02:07645] crs:blcr: blcr_checkpoint_peer: exec :
>>>> (cr_checkpoint,
>>>> cr_checkpoint --pid 7645 --file
>>>> /tmp/opal_snapshot_1.ckpt/ompi_blcr_context.7645):
>>>> [aogrd02:07645] crs:blcr: blcr_checkpoint_peer: Child failed to
>>>> execute :(-1):
>>>> [aogrd02:07645] crs:blcr: blcr_checkpoint_peer: execvp returned -1
>>>> ...
>>>> [aogrd02:07642] Local) Location: [/tmp/
>>>> opal_snapshot_1.ckpt]
>>>>
>>>> The application stop here and don´t continue the execution. It´s
>>>> using libcr version 0.6.5
>>>> $ lsof -p 7518
>>>> /softs/blcr-0.6.5/0.6.5/lib/libcr.so.0.2.1
>>>>
>>>> After orte-checkpoint command the application process is duplicated
>>>> on
>>>> the nodes, like a child of the original process.
>>>> When a run an application with this version and take a checkpoint
>>>> manually, I have no problem...
>>>>
>>>> Leonardo Fialho
>>>> Computer Architecture and Operating Systems Department - CAOS
>>>> Universidad Autonoma de Barcelona - UAB
>>>> ETSE, Edifcio Q, QC/3088
>>>> http://www.caos
>>>> Phone: +34-93-581-2888
>>>> Fax: +34-93-581-2478
>>>>
>>> Leonardo Fialho
>>> Computer Architecture and Operating Systems Department - CAOS
>>> Universidad Autonoma de Barcelona - UAB
>>> ETSE, Edifcio Q, QC/3088
>>> http://www.caos.uab.es
>>> Phone: +34-93-581-2888
>>> Fax: +34-93-581-2478
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> --
> Leonardo Fialho
> Computer Architecture and Operating Systems Department - CAOS
> Universidad Autonoma de Barcelona - UAB
> ETSE, Edifcio Q, QC/3088
> http://www.caos.uab.es
> Phone: +34-93-581-2888
> Fax: +34-93-581-2478
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users