Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] How to restart a job twice
From: Tamer (tamer_at_[hidden])
Date: 2008-04-18 09:36:05


Thanks Josh, I tried what you suggested with my existing r14519, and I
was able to checkpoint the restarted job but was never able to restart
it. I looked up the PID for 'orterun' and checkpointed the restarted
job but when I try to restart from that point I get the following error:

ompi-restart ompi_global_snapshot_7704.ckpt
[dhcp-119-202.caltech.edu:07292] [[61851,1],1]-[[61851,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:07292] [[61851,1],1] routed:unity:
Connection to lifeline [[61851,0],0] lost
[dhcp-119-202.caltech.edu:07292] [[61851,1],1]-[[61851,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:07292] [[61851,1],1] routed:unity:
Connection to lifeline [[61851,0],0] lost
[dhcp-119-202.caltech.edu:07291] [[61851,1],0]-[[61851,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:07291] [[61851,1],0] routed:unity:
Connection to lifeline [[61851,0],0] lost
[dhcp-119-202.caltech.edu:07291] [[61851,1],0]-[[61851,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:07291] [[61851,1],0] routed:unity:
Connection to lifeline [[61851,0],0] lost
--------------------------------------------------------------------------
orterun has exited due to process rank 1 with PID 7737 on
node dhcp-119-202.caltech.edu exiting without calling "finalize". This
may
have caused other processes in the application to be
terminated by signals sent by orterun (as reported here).

Do I have to run the copenmpi clean command after the first checkpoint
and before restarting the checkpointed job so I can checkpoint it
again or is there something I am missing in this version completely
and I would have to go to r18208? Thank you in advance for your help.

Tamer

On Apr 18, 2008, at 6:03 AM, Josh Hursey wrote:

> When you use 'ompi-restart' to restart a job it fork/execs the
> completely new job using the restarted processes for the ranks.
> However instead of calling the 'mpirun' process ompi-restart currently
> calls 'orterun'. These two programs are exactly the same (mpirun is a
> symbolic link to orterun). So if you look for the PID of 'orterun'
> that can be used to checkpoint the process.
>
> However it is confusing that Open MPI makes this switch. So I
> committed in r18208 a fix for this that uses the 'mpirun' binary name
> instead of the 'orterun' binary name. This fits with the typical use
> case of checkpoint/restart in Open MPI in which users expect to find
> the 'mpirun' process on restart instead of the lesser known 'orterun'
> process.
>
> Sorry for the confusion.
>
> Josh
>
> On Apr 18, 2008, at 1:14 AM, Tamer wrote:
>
>> Dear all, I installed the developer's version r14519 and was able to
>> get it running. I successfully checkpointed a parallel job and
>> restarted it. My question is how can I checkpoint the restarted job?
>> The problem is once the original job is terminated and restarted
>> later
>> on, the mpirun does not exist anymore (ps -efa|grep mpirun) and hence
>> I do not know which PID I should use when I run the ompi-checkpoint
>> on
>> the restarted job. Any help would be greatly appreciated.
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users