Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] How to restart a job twice
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2008-04-24 11:02:19


Tamer,

Another user contacted me off list yesterday with a similar problem
with the current trunk. I have been able to reproduce this, and am
currently trying to debug it again. It seems to occur more often with
builds without the checkpoint thread (--disable-ft-thread). It seems
to be a race in our connection wireup which is why it does not always
occur.

Thank you for your patience as I try to track this down. I'll let you
know as soon as I have a fix.

Cheers,
Josh

On Apr 24, 2008, at 10:50 AM, Tamer wrote:

> Josh, Thank you for your help. I was able to do the following with
> r18241:
>
> start the parallel job
> checkpoint and restart
> checkpoint and restart
> checkpoint but failed to restart with the following message:
>
> ompi-restart ompi_global_snapshot_23800.ckpt
> [dhcp-119-202.caltech.edu:23650] [[45699,1],1]-[[45699,0],0]
> mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
> [dhcp-119-202.caltech.edu:23650] [[45699,1],1] routed:tree: Connection
> to lifeline [[45699,0],0] lost
> [dhcp-119-202.caltech.edu:23650] [[45699,1],1]-[[45699,0],0]
> mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
> [dhcp-119-202.caltech.edu:23650] [[45699,1],1] routed:tree: Connection
> to lifeline [[45699,0],0] lost
> [dhcp-119-202:23650] *** Process received signal ***
> [dhcp-119-202:23650] Signal: Segmentation fault (11)
> [dhcp-119-202:23650] Signal code: Address not mapped (1)
> [dhcp-119-202:23650] Failing at address: 0x3e0f50
> [dhcp-119-202:23650] [ 0] [0x110440]
> [dhcp-119-202:23650] [ 1] /lib/libc.so.6(__libc_start_main+0x107)
> [0xc5df97]
> [dhcp-119-202:23650] [ 2] ./ares-openmpi-r18241 [0x81703b1]
> [dhcp-119-202:23650] *** End of error message ***
> --------------------------------------------------------------------------
> mpirun noticed that process rank 1 with PID 23857 on node
> dhcp-119-202.caltech.edu exited on signal 11 (Segmentation fault).
>
>
> So, this time the process went further than before. I tested on a
> different platform (64 bit machine with fedora core 7) and openmpi
> checkpoints and restarts as many times as I want to without any
> problems. This means that the issue above must be platform dependent
> and I must be missing some option in building the code.
>
> Cheers,
> Tamer
>
>
> On Apr 22, 2008, at 5:52 PM, Josh Hursey wrote:
>
>> Tamer,
>>
>> This should now be fixed in r18241.
>>
>> Though I was able to replicate this bug, it only occurred
>> sporadically for me. It seemed to be caused by some socket descriptor
>> caching that was not properly cleaned up by the restart procedure.
>>
>> My testing appears to conclude that this bug is now fixed, but since
>> it is difficult to reproduce if you see it happen again definitely
>> let me know.
>>
>>
>> With the current trunk you may see the following error message:
>> --------------------------------------
>> [odin001][[7448,1],0][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>> --------------------------------------
>> This is not caused by the checkpoint/restart code, but by some recent
>> changes to our TCP component. We are working on fixing this, but I
>> just wanted to give you a heads up in case you see this error. As far
>> as I can tell it does not interfere with the checkpoint/restart
>> functionality.
>>
>> Let me know if this fixes your problem.
>>
>> Cheers,
>> Josh
>>
>>
>> On Apr 22, 2008, at 9:16 AM, Josh Hursey wrote:
>>
>>> Tamer,
>>>
>>> Just wanted to update you on my progress. I am able to reproduce
>>> something similar to this problem. I am currently working on a
>>> solution to it. I'll let you know when it is available, probably in
>>> the next day or two.
>>>
>>> Thank you for the bug report.
>>>
>>> Cheers,
>>> Josh
>>>
>>> On Apr 18, 2008, at 1:11 PM, Tamer wrote:
>>>
>>>> Hi Josh:
>>>>
>>>> I am running on linux fedora core 7 kernel: 2.6.23.15-80.fc7
>>>>
>>>> The machine is dual-core with shared memory so it's not even a
>>>> cluster.
>>>>
>>>> I downloaded r18208 and built it with the following options:
>>>>
>>>> ./configure --prefix=/usr/local/openmpi-with-checkpointing-r18208
>>>> --
>>>> with-ft=cr --with-blcr=/usr/local/blcr
>>>>
>>>> when I run mpirun I pass the following command:
>>>>
>>>> mpirun -np 2 -am ft-enable-cr ./ares-openmpi -c -f madonna-13760
>>>>
>>>> I was able to checkpoint and restart successfully and was able to
>>>> checkpoint the restarted job (mpirun showed up with ps-efa |grep
>>>> mpirun under r18208) but was unable to restart again; here's the
>>>> error message:
>>>>
>>>> mpi-restart ompi_global_snapshot_23865.ckpt
>>>> [dhcp-119-202.caltech.edu:23846] [[45670,1],1]-[[45670,0],0]
>>>> mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
>>>> [dhcp-119-202.caltech.edu:23846] [[45670,1],1] routed:unity:
>>>> Connection to lifeline [[45670,0],0] lost
>>>> [dhcp-119-202.caltech.edu:23845] [[45670,1],0]-[[45670,0],0]
>>>> mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
>>>> [dhcp-119-202.caltech.edu:23845] [[45670,1],0] routed:unity:
>>>> Connection to lifeline [[45670,0],0] lost
>>>> [dhcp-119-202.caltech.edu:23846] [[45670,1],1]-[[45670,0],0]
>>>> mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
>>>> [dhcp-119-202.caltech.edu:23846] [[45670,1],1] routed:unity:
>>>> Connection to lifeline [[45670,0],0] lost
>>>> [dhcp-119-202.caltech.edu:23845] [[45670,1],0]-[[45670,0],0]
>>>> mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
>>>> [dhcp-119-202.caltech.edu:23845] [[45670,1],0] routed:unity:
>>>> Connection to lifeline [[45670,0],0] lost
>>>> ---------------------------------------------------------------------
>>>> -----
>>>> mpirun has exited due to process rank 1 with PID 24012 on
>>>> node dhcp-119-202.caltech.edu exiting without calling "finalize".
>>>> This may
>>>> have caused other processes in the application to be
>>>> terminated by signals sent by mpirun (as reported here).
>>>>
>>>> Thank you in advance for your help.
>>>>
>>>> Tamer
>>>>
>>>>
>>>> On Apr 18, 2008, at 7:07 AM, Josh Hursey wrote:
>>>>
>>>>> This problem has come up in the past and may have been fixed since
>>>>> r14519. Can you update to r18208 and see if the error still
>>>>> occurs?
>>>>>
>>>>> A few other questions that will help me try to reproduce the
>>>>> problem.
>>>>> Can you tell me more about the configuration of the system you are
>>>>> running on (number of machines, if there is a resource manager)?
>>>>> How
>>>>> did you configure Open MPI and what command line options are you
>>>>> passing to 'mpirun'?
>>>>>
>>>>> -- Josh
>>>>>
>>>>> On Apr 18, 2008, at 9:36 AM, Tamer wrote:
>>>>>
>>>>>> Thanks Josh, I tried what you suggested with my existing r14519,
>>>>>> and I
>>>>>> was able to checkpoint the restarted job but was never able to
>>>>>> restart
>>>>>> it. I looked up the PID for 'orterun' and checkpointed the
>>>>>> restarted
>>>>>> job but when I try to restart from that point I get the following
>>>>>> error:
>>>>>>
>>>>>> ompi-restart ompi_global_snapshot_7704.ckpt
>>>>>> [dhcp-119-202.caltech.edu:07292] [[61851,1],1]-[[61851,0],0]
>>>>>> mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
>>>>>> [dhcp-119-202.caltech.edu:07292] [[61851,1],1] routed:unity:
>>>>>> Connection to lifeline [[61851,0],0] lost
>>>>>> [dhcp-119-202.caltech.edu:07292] [[61851,1],1]-[[61851,0],0]
>>>>>> mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
>>>>>> [dhcp-119-202.caltech.edu:07292] [[61851,1],1] routed:unity:
>>>>>> Connection to lifeline [[61851,0],0] lost
>>>>>> [dhcp-119-202.caltech.edu:07291] [[61851,1],0]-[[61851,0],0]
>>>>>> mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
>>>>>> [dhcp-119-202.caltech.edu:07291] [[61851,1],0] routed:unity:
>>>>>> Connection to lifeline [[61851,0],0] lost
>>>>>> [dhcp-119-202.caltech.edu:07291] [[61851,1],0]-[[61851,0],0]
>>>>>> mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
>>>>>> [dhcp-119-202.caltech.edu:07291] [[61851,1],0] routed:unity:
>>>>>> Connection to lifeline [[61851,0],0] lost
>>>>>> -------------------------------------------------------------------
>>>>>> -------
>>>>>> orterun has exited due to process rank 1 with PID 7737 on
>>>>>> node dhcp-119-202.caltech.edu exiting without calling "finalize".
>>>>>> This
>>>>>> may
>>>>>> have caused other processes in the application to be
>>>>>> terminated by signals sent by orterun (as reported here).
>>>>>>
>>>>>> Do I have to run the copenmpi clean command after the first
>>>>>> checkpoint
>>>>>> and before restarting the checkpointed job so I can checkpoint it
>>>>>> again or is there something I am missing in this version
>>>>>> completely
>>>>>> and I would have to go to r18208? Thank you in advance for your
>>>>>> help.
>>>>>>
>>>>>> Tamer
>>>>>>
>>>>>> On Apr 18, 2008, at 6:03 AM, Josh Hursey wrote:
>>>>>>
>>>>>>> When you use 'ompi-restart' to restart a job it fork/execs the
>>>>>>> completely new job using the restarted processes for the ranks.
>>>>>>> However instead of calling the 'mpirun' process ompi-restart
>>>>>>> currently
>>>>>>> calls 'orterun'. These two programs are exactly the same (mpirun
>>>>>>> is a
>>>>>>> symbolic link to orterun). So if you look for the PID of
>>>>>>> 'orterun'
>>>>>>> that can be used to checkpoint the process.
>>>>>>>
>>>>>>> However it is confusing that Open MPI makes this switch. So I
>>>>>>> committed in r18208 a fix for this that uses the 'mpirun' binary
>>>>>>> name
>>>>>>> instead of the 'orterun' binary name. This fits with the typical
>>>>>>> use
>>>>>>> case of checkpoint/restart in Open MPI in which users expect to
>>>>>>> find
>>>>>>> the 'mpirun' process on restart instead of the lesser known
>>>>>>> 'orterun'
>>>>>>> process.
>>>>>>>
>>>>>>> Sorry for the confusion.
>>>>>>>
>>>>>>> Josh
>>>>>>>
>>>>>>> On Apr 18, 2008, at 1:14 AM, Tamer wrote:
>>>>>>>
>>>>>>>> Dear all, I installed the developer's version r14519 and was
>>>>>>>> able to
>>>>>>>> get it running. I successfully checkpointed a parallel job and
>>>>>>>> restarted it. My question is how can I checkpoint the restarted
>>>>>>>> job?
>>>>>>>> The problem is once the original job is terminated and
>>>>>>>> restarted
>>>>>>>> later
>>>>>>>> on, the mpirun does not exist anymore (ps -efa|grep mpirun) and
>>>>>>>> hence
>>>>>>>> I do not know which PID I should use when I run the ompi-
>>>>>>>> checkpoint
>>>>>>>> on
>>>>>>>> the restarted job. Any help would be greatly appreciated.
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users