Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] How to restart a job twice
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2008-04-22 20:52:44


Tamer,

This should now be fixed in r18241.

Though I was able to replicate this bug, it only occurred
sporadically for me. It seemed to be caused by some socket descriptor
caching that was not properly cleaned up by the restart procedure.

My testing appears to conclude that this bug is now fixed, but since
it is difficult to reproduce if you see it happen again definitely
let me know.

With the current trunk you may see the following error message:
--------------------------------------
[odin001][[7448,1],0][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
--------------------------------------
This is not caused by the checkpoint/restart code, but by some recent
changes to our TCP component. We are working on fixing this, but I
just wanted to give you a heads up in case you see this error. As far
as I can tell it does not interfere with the checkpoint/restart
functionality.

Let me know if this fixes your problem.

Cheers,
Josh

On Apr 22, 2008, at 9:16 AM, Josh Hursey wrote:

> Tamer,
>
> Just wanted to update you on my progress. I am able to reproduce
> something similar to this problem. I am currently working on a
> solution to it. I'll let you know when it is available, probably in
> the next day or two.
>
> Thank you for the bug report.
>
> Cheers,
> Josh
>
> On Apr 18, 2008, at 1:11 PM, Tamer wrote:
>
>> Hi Josh:
>>
>> I am running on linux fedora core 7 kernel: 2.6.23.15-80.fc7
>>
>> The machine is dual-core with shared memory so it's not even a
>> cluster.
>>
>> I downloaded r18208 and built it with the following options:
>>
>> ./configure --prefix=/usr/local/openmpi-with-checkpointing-r18208 --
>> with-ft=cr --with-blcr=/usr/local/blcr
>>
>> when I run mpirun I pass the following command:
>>
>> mpirun -np 2 -am ft-enable-cr ./ares-openmpi -c -f madonna-13760
>>
>> I was able to checkpoint and restart successfully and was able to
>> checkpoint the restarted job (mpirun showed up with ps-efa |grep
>> mpirun under r18208) but was unable to restart again; here's the
>> error message:
>>
>> mpi-restart ompi_global_snapshot_23865.ckpt
>> [dhcp-119-202.caltech.edu:23846] [[45670,1],1]-[[45670,0],0]
>> mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
>> [dhcp-119-202.caltech.edu:23846] [[45670,1],1] routed:unity:
>> Connection to lifeline [[45670,0],0] lost
>> [dhcp-119-202.caltech.edu:23845] [[45670,1],0]-[[45670,0],0]
>> mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
>> [dhcp-119-202.caltech.edu:23845] [[45670,1],0] routed:unity:
>> Connection to lifeline [[45670,0],0] lost
>> [dhcp-119-202.caltech.edu:23846] [[45670,1],1]-[[45670,0],0]
>> mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
>> [dhcp-119-202.caltech.edu:23846] [[45670,1],1] routed:unity:
>> Connection to lifeline [[45670,0],0] lost
>> [dhcp-119-202.caltech.edu:23845] [[45670,1],0]-[[45670,0],0]
>> mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
>> [dhcp-119-202.caltech.edu:23845] [[45670,1],0] routed:unity:
>> Connection to lifeline [[45670,0],0] lost
>> ---------------------------------------------------------------------
>> -----
>> mpirun has exited due to process rank 1 with PID 24012 on
>> node dhcp-119-202.caltech.edu exiting without calling "finalize".
>> This may
>> have caused other processes in the application to be
>> terminated by signals sent by mpirun (as reported here).
>>
>> Thank you in advance for your help.
>>
>> Tamer
>>
>>
>> On Apr 18, 2008, at 7:07 AM, Josh Hursey wrote:
>>
>>> This problem has come up in the past and may have been fixed since
>>> r14519. Can you update to r18208 and see if the error still occurs?
>>>
>>> A few other questions that will help me try to reproduce the
>>> problem.
>>> Can you tell me more about the configuration of the system you are
>>> running on (number of machines, if there is a resource manager)? How
>>> did you configure Open MPI and what command line options are you
>>> passing to 'mpirun'?
>>>
>>> -- Josh
>>>
>>> On Apr 18, 2008, at 9:36 AM, Tamer wrote:
>>>
>>>> Thanks Josh, I tried what you suggested with my existing r14519,
>>>> and I
>>>> was able to checkpoint the restarted job but was never able to
>>>> restart
>>>> it. I looked up the PID for 'orterun' and checkpointed the
>>>> restarted
>>>> job but when I try to restart from that point I get the following
>>>> error:
>>>>
>>>> ompi-restart ompi_global_snapshot_7704.ckpt
>>>> [dhcp-119-202.caltech.edu:07292] [[61851,1],1]-[[61851,0],0]
>>>> mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
>>>> [dhcp-119-202.caltech.edu:07292] [[61851,1],1] routed:unity:
>>>> Connection to lifeline [[61851,0],0] lost
>>>> [dhcp-119-202.caltech.edu:07292] [[61851,1],1]-[[61851,0],0]
>>>> mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
>>>> [dhcp-119-202.caltech.edu:07292] [[61851,1],1] routed:unity:
>>>> Connection to lifeline [[61851,0],0] lost
>>>> [dhcp-119-202.caltech.edu:07291] [[61851,1],0]-[[61851,0],0]
>>>> mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
>>>> [dhcp-119-202.caltech.edu:07291] [[61851,1],0] routed:unity:
>>>> Connection to lifeline [[61851,0],0] lost
>>>> [dhcp-119-202.caltech.edu:07291] [[61851,1],0]-[[61851,0],0]
>>>> mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
>>>> [dhcp-119-202.caltech.edu:07291] [[61851,1],0] routed:unity:
>>>> Connection to lifeline [[61851,0],0] lost
>>>> -------------------------------------------------------------------
>>>> -------
>>>> orterun has exited due to process rank 1 with PID 7737 on
>>>> node dhcp-119-202.caltech.edu exiting without calling "finalize".
>>>> This
>>>> may
>>>> have caused other processes in the application to be
>>>> terminated by signals sent by orterun (as reported here).
>>>>
>>>> Do I have to run the copenmpi clean command after the first
>>>> checkpoint
>>>> and before restarting the checkpointed job so I can checkpoint it
>>>> again or is there something I am missing in this version completely
>>>> and I would have to go to r18208? Thank you in advance for your
>>>> help.
>>>>
>>>> Tamer
>>>>
>>>> On Apr 18, 2008, at 6:03 AM, Josh Hursey wrote:
>>>>
>>>>> When you use 'ompi-restart' to restart a job it fork/execs the
>>>>> completely new job using the restarted processes for the ranks.
>>>>> However instead of calling the 'mpirun' process ompi-restart
>>>>> currently
>>>>> calls 'orterun'. These two programs are exactly the same (mpirun
>>>>> is a
>>>>> symbolic link to orterun). So if you look for the PID of 'orterun'
>>>>> that can be used to checkpoint the process.
>>>>>
>>>>> However it is confusing that Open MPI makes this switch. So I
>>>>> committed in r18208 a fix for this that uses the 'mpirun' binary
>>>>> name
>>>>> instead of the 'orterun' binary name. This fits with the typical
>>>>> use
>>>>> case of checkpoint/restart in Open MPI in which users expect to
>>>>> find
>>>>> the 'mpirun' process on restart instead of the lesser known
>>>>> 'orterun'
>>>>> process.
>>>>>
>>>>> Sorry for the confusion.
>>>>>
>>>>> Josh
>>>>>
>>>>> On Apr 18, 2008, at 1:14 AM, Tamer wrote:
>>>>>
>>>>>> Dear all, I installed the developer's version r14519 and was
>>>>>> able to
>>>>>> get it running. I successfully checkpointed a parallel job and
>>>>>> restarted it. My question is how can I checkpoint the restarted
>>>>>> job?
>>>>>> The problem is once the original job is terminated and restarted
>>>>>> later
>>>>>> on, the mpirun does not exist anymore (ps -efa|grep mpirun) and
>>>>>> hence
>>>>>> I do not know which PID I should use when I run the ompi-
>>>>>> checkpoint
>>>>>> on
>>>>>> the restarted job. Any help would be greatly appreciated.
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users