Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] How to restart a job twice
From: Tamer (tamer_at_[hidden])
Date: 2008-04-24 10:50:59


Josh, Thank you for your help. I was able to do the following with
r18241:

start the parallel job
checkpoint and restart
checkpoint and restart
checkpoint but failed to restart with the following message:

ompi-restart ompi_global_snapshot_23800.ckpt
[dhcp-119-202.caltech.edu:23650] [[45699,1],1]-[[45699,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:23650] [[45699,1],1] routed:tree: Connection
to lifeline [[45699,0],0] lost
[dhcp-119-202.caltech.edu:23650] [[45699,1],1]-[[45699,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:23650] [[45699,1],1] routed:tree: Connection
to lifeline [[45699,0],0] lost
[dhcp-119-202:23650] *** Process received signal ***
[dhcp-119-202:23650] Signal: Segmentation fault (11)
[dhcp-119-202:23650] Signal code: Address not mapped (1)
[dhcp-119-202:23650] Failing at address: 0x3e0f50
[dhcp-119-202:23650] [ 0] [0x110440]
[dhcp-119-202:23650] [ 1] /lib/libc.so.6(__libc_start_main+0x107)
[0xc5df97]
[dhcp-119-202:23650] [ 2] ./ares-openmpi-r18241 [0x81703b1]
[dhcp-119-202:23650] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 23857 on node
dhcp-119-202.caltech.edu exited on signal 11 (Segmentation fault).

So, this time the process went further than before. I tested on a
different platform (64 bit machine with fedora core 7) and openmpi
checkpoints and restarts as many times as I want to without any
problems. This means that the issue above must be platform dependent
and I must be missing some option in building the code.

Cheers,
Tamer

On Apr 22, 2008, at 5:52 PM, Josh Hursey wrote:

> Tamer,
>
> This should now be fixed in r18241.
>
> Though I was able to replicate this bug, it only occurred
> sporadically for me. It seemed to be caused by some socket descriptor
> caching that was not properly cleaned up by the restart procedure.
>
> My testing appears to conclude that this bug is now fixed, but since
> it is difficult to reproduce if you see it happen again definitely
> let me know.
>
>
> With the current trunk you may see the following error message:
> --------------------------------------
> [odin001][[7448,1],0][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> --------------------------------------
> This is not caused by the checkpoint/restart code, but by some recent
> changes to our TCP component. We are working on fixing this, but I
> just wanted to give you a heads up in case you see this error. As far
> as I can tell it does not interfere with the checkpoint/restart
> functionality.
>
> Let me know if this fixes your problem.
>
> Cheers,
> Josh
>
>
> On Apr 22, 2008, at 9:16 AM, Josh Hursey wrote:
>
>> Tamer,
>>
>> Just wanted to update you on my progress. I am able to reproduce
>> something similar to this problem. I am currently working on a
>> solution to it. I'll let you know when it is available, probably in
>> the next day or two.
>>
>> Thank you for the bug report.
>>
>> Cheers,
>> Josh
>>
>> On Apr 18, 2008, at 1:11 PM, Tamer wrote:
>>
>>> Hi Josh:
>>>
>>> I am running on linux fedora core 7 kernel: 2.6.23.15-80.fc7
>>>
>>> The machine is dual-core with shared memory so it's not even a
>>> cluster.
>>>
>>> I downloaded r18208 and built it with the following options:
>>>
>>> ./configure --prefix=/usr/local/openmpi-with-checkpointing-r18208 --
>>> with-ft=cr --with-blcr=/usr/local/blcr
>>>
>>> when I run mpirun I pass the following command:
>>>
>>> mpirun -np 2 -am ft-enable-cr ./ares-openmpi -c -f madonna-13760
>>>
>>> I was able to checkpoint and restart successfully and was able to
>>> checkpoint the restarted job (mpirun showed up with ps-efa |grep
>>> mpirun under r18208) but was unable to restart again; here's the
>>> error message:
>>>
>>> mpi-restart ompi_global_snapshot_23865.ckpt
>>> [dhcp-119-202.caltech.edu:23846] [[45670,1],1]-[[45670,0],0]
>>> mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
>>> [dhcp-119-202.caltech.edu:23846] [[45670,1],1] routed:unity:
>>> Connection to lifeline [[45670,0],0] lost
>>> [dhcp-119-202.caltech.edu:23845] [[45670,1],0]-[[45670,0],0]
>>> mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
>>> [dhcp-119-202.caltech.edu:23845] [[45670,1],0] routed:unity:
>>> Connection to lifeline [[45670,0],0] lost
>>> [dhcp-119-202.caltech.edu:23846] [[45670,1],1]-[[45670,0],0]
>>> mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
>>> [dhcp-119-202.caltech.edu:23846] [[45670,1],1] routed:unity:
>>> Connection to lifeline [[45670,0],0] lost
>>> [dhcp-119-202.caltech.edu:23845] [[45670,1],0]-[[45670,0],0]
>>> mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
>>> [dhcp-119-202.caltech.edu:23845] [[45670,1],0] routed:unity:
>>> Connection to lifeline [[45670,0],0] lost
>>> ---------------------------------------------------------------------
>>> -----
>>> mpirun has exited due to process rank 1 with PID 24012 on
>>> node dhcp-119-202.caltech.edu exiting without calling "finalize".
>>> This may
>>> have caused other processes in the application to be
>>> terminated by signals sent by mpirun (as reported here).
>>>
>>> Thank you in advance for your help.
>>>
>>> Tamer
>>>
>>>
>>> On Apr 18, 2008, at 7:07 AM, Josh Hursey wrote:
>>>
>>>> This problem has come up in the past and may have been fixed since
>>>> r14519. Can you update to r18208 and see if the error still occurs?
>>>>
>>>> A few other questions that will help me try to reproduce the
>>>> problem.
>>>> Can you tell me more about the configuration of the system you are
>>>> running on (number of machines, if there is a resource manager)?
>>>> How
>>>> did you configure Open MPI and what command line options are you
>>>> passing to 'mpirun'?
>>>>
>>>> -- Josh
>>>>
>>>> On Apr 18, 2008, at 9:36 AM, Tamer wrote:
>>>>
>>>>> Thanks Josh, I tried what you suggested with my existing r14519,
>>>>> and I
>>>>> was able to checkpoint the restarted job but was never able to
>>>>> restart
>>>>> it. I looked up the PID for 'orterun' and checkpointed the
>>>>> restarted
>>>>> job but when I try to restart from that point I get the following
>>>>> error:
>>>>>
>>>>> ompi-restart ompi_global_snapshot_7704.ckpt
>>>>> [dhcp-119-202.caltech.edu:07292] [[61851,1],1]-[[61851,0],0]
>>>>> mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
>>>>> [dhcp-119-202.caltech.edu:07292] [[61851,1],1] routed:unity:
>>>>> Connection to lifeline [[61851,0],0] lost
>>>>> [dhcp-119-202.caltech.edu:07292] [[61851,1],1]-[[61851,0],0]
>>>>> mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
>>>>> [dhcp-119-202.caltech.edu:07292] [[61851,1],1] routed:unity:
>>>>> Connection to lifeline [[61851,0],0] lost
>>>>> [dhcp-119-202.caltech.edu:07291] [[61851,1],0]-[[61851,0],0]
>>>>> mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
>>>>> [dhcp-119-202.caltech.edu:07291] [[61851,1],0] routed:unity:
>>>>> Connection to lifeline [[61851,0],0] lost
>>>>> [dhcp-119-202.caltech.edu:07291] [[61851,1],0]-[[61851,0],0]
>>>>> mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
>>>>> [dhcp-119-202.caltech.edu:07291] [[61851,1],0] routed:unity:
>>>>> Connection to lifeline [[61851,0],0] lost
>>>>> -------------------------------------------------------------------
>>>>> -------
>>>>> orterun has exited due to process rank 1 with PID 7737 on
>>>>> node dhcp-119-202.caltech.edu exiting without calling "finalize".
>>>>> This
>>>>> may
>>>>> have caused other processes in the application to be
>>>>> terminated by signals sent by orterun (as reported here).
>>>>>
>>>>> Do I have to run the copenmpi clean command after the first
>>>>> checkpoint
>>>>> and before restarting the checkpointed job so I can checkpoint it
>>>>> again or is there something I am missing in this version
>>>>> completely
>>>>> and I would have to go to r18208? Thank you in advance for your
>>>>> help.
>>>>>
>>>>> Tamer
>>>>>
>>>>> On Apr 18, 2008, at 6:03 AM, Josh Hursey wrote:
>>>>>
>>>>>> When you use 'ompi-restart' to restart a job it fork/execs the
>>>>>> completely new job using the restarted processes for the ranks.
>>>>>> However instead of calling the 'mpirun' process ompi-restart
>>>>>> currently
>>>>>> calls 'orterun'. These two programs are exactly the same (mpirun
>>>>>> is a
>>>>>> symbolic link to orterun). So if you look for the PID of
>>>>>> 'orterun'
>>>>>> that can be used to checkpoint the process.
>>>>>>
>>>>>> However it is confusing that Open MPI makes this switch. So I
>>>>>> committed in r18208 a fix for this that uses the 'mpirun' binary
>>>>>> name
>>>>>> instead of the 'orterun' binary name. This fits with the typical
>>>>>> use
>>>>>> case of checkpoint/restart in Open MPI in which users expect to
>>>>>> find
>>>>>> the 'mpirun' process on restart instead of the lesser known
>>>>>> 'orterun'
>>>>>> process.
>>>>>>
>>>>>> Sorry for the confusion.
>>>>>>
>>>>>> Josh
>>>>>>
>>>>>> On Apr 18, 2008, at 1:14 AM, Tamer wrote:
>>>>>>
>>>>>>> Dear all, I installed the developer's version r14519 and was
>>>>>>> able to
>>>>>>> get it running. I successfully checkpointed a parallel job and
>>>>>>> restarted it. My question is how can I checkpoint the restarted
>>>>>>> job?
>>>>>>> The problem is once the original job is terminated and restarted
>>>>>>> later
>>>>>>> on, the mpirun does not exist anymore (ps -efa|grep mpirun) and
>>>>>>> hence
>>>>>>> I do not know which PID I should use when I run the ompi-
>>>>>>> checkpoint
>>>>>>> on
>>>>>>> the restarted job. Any help would be greatly appreciated.
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users