Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] How to restart a job twice
From: Tamer (tamer_at_[hidden])
Date: 2008-04-18 09:36:05


Thanks Josh, I tried what you suggested with my existing r14519, and I
was able to checkpoint the restarted job but was never able to restart
it. I looked up the PID for 'orterun' and checkpointed the restarted
job but when I try to restart from that point I get the following error:

ompi-restart ompi_global_snapshot_7704.ckpt
[dhcp-119-202.caltech.edu:07292] [[61851,1],1]-[[61851,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:07292] [[61851,1],1] routed:unity:
Connection to lifeline [[61851,0],0] lost
[dhcp-119-202.caltech.edu:07292] [[61851,1],1]-[[61851,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:07292] [[61851,1],1] routed:unity:
Connection to lifeline [[61851,0],0] lost
[dhcp-119-202.caltech.edu:07291] [[61851,1],0]-[[61851,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:07291] [[61851,1],0] routed:unity:
Connection to lifeline [[61851,0],0] lost
[dhcp-119-202.caltech.edu:07291] [[61851,1],0]-[[61851,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:07291] [[61851,1],0] routed:unity:
Connection to lifeline [[61851,0],0] lost
--------------------------------------------------------------------------
orterun has exited due to process rank 1 with PID 7737 on
node dhcp-119-202.caltech.edu exiting without calling "finalize". This
may
have caused other processes in the application to be
terminated by signals sent by orterun (as reported here).

Do I have to run the copenmpi clean command after the first checkpoint
and before restarting the checkpointed job so I can checkpoint it
again or is there something I am missing in this version completely
and I would have to go to r18208? Thank you in advance for your help.

Tamer

On Apr 18, 2008, at 6:03 AM, Josh Hursey wrote:

> When you use 'ompi-restart' to restart a job it fork/execs the
> completely new job using the restarted processes for the ranks.
> However instead of calling the 'mpirun' process ompi-restart currently
> calls 'orterun'. These two programs are exactly the same (mpirun is a
> symbolic link to orterun). So if you look for the PID of 'orterun'
> that can be used to checkpoint the process.
>
> However it is confusing that Open MPI makes this switch. So I
> committed in r18208 a fix for this that uses the 'mpirun' binary name
> instead of the 'orterun' binary name. This fits with the typical use
> case of checkpoint/restart in Open MPI in which users expect to find
> the 'mpirun' process on restart instead of the lesser known 'orterun'
> process.
>
> Sorry for the confusion.
>
> Josh
>
> On Apr 18, 2008, at 1:14 AM, Tamer wrote:
>
>> Dear all, I installed the developer's version r14519 and was able to
>> get it running. I successfully checkpointed a parallel job and
>> restarted it. My question is how can I checkpoint the restarted job?
>> The problem is once the original job is terminated and restarted
>> later
>> on, the mpirun does not exist anymore (ps -efa|grep mpirun) and hence
>> I do not know which PID I should use when I run the ompi-checkpoint
>> on
>> the restarted job. Any help would be greatly appreciated.
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users