Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] [PATCH] make orte-checkpoint communicate with orterun again
From: Adrian Reber (adrian_at_[hidden])
Date: 2014-01-24 12:35:53


Status update of C/R with Open MPI:

With the last two patches applied I am now seeing communication
between orte-checkpoint and orterun:

orte-checkpoint 23975:

[dcbz:23986] orte_checkpoint: Checkpointing...
[dcbz:23986] PID 23975
[dcbz:23986] Connected to Mpirun [[45520,0],0]
[dcbz:23986] orte_checkpoint: notify_hnp: Contact Head Node Process PID 23975
[dcbz:23986] [[45509,0],0] rml_send_buffer to peer [[45520,0],0] at tag 13
[dcbz:23986] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID]
[dcbz:23986] [[45509,0],0] posting recv
[dcbz:23986] [[45509,0],0] posting persistent recv on tag 9 for peer [[WILDCARD],WILDCARD]
[dcbz:23986] [[45509,0],0] posting recv
[dcbz:23986] [[45509,0],0] posting persistent recv on tag 13 for peer [[WILDCARD],WILDCARD]
[dcbz:23986] [[45509,0],0] rml_send_msg to peer [[45520,0],0] at tag 13
[dcbz:23986] [[45509,0],0]-[[45520,0],0] Send message complete at ../../../../../orte/mca/oob/tcp/oob_tcp_sendrecv.c:220
[dcbz:23986] [[45509,0],0] Message posted at ../../../../../orte/mca/oob/tcp/oob_tcp_sendrecv.c:519
[dcbz:23986] [[45509,0],0] message received 39 bytes from [[45520,0],0] for tag 13
[dcbz:23986] orte_checkpoint: hnp_receiver: Receive a command message.
[dcbz:23986] orte_checkpoint: hnp_receiver: Status Update.
--------------------------------------------------------------------------
Error: The application (PID = 23975) failed to checkpoint properly.
       Returned -1.
--------------------------------------------------------------------------

orterun:

[dcbz:23975] [[45520,0],0] Message posted at ../../../../../orte/mca/oob/tcp/oob_tcp_sendrecv.c:519
[dcbz:23975] [[45520,0],0] message received 50 bytes from [[45509,0],0] for tag 13
[dcbz:23975] Global) Command Line: Start a checkpoint operation [Sender = [[45509,0],0]]
[dcbz:23975] Global) Command line requested a checkpoint [command 1]
[dcbz:23975] Global-Local) base:ckpt_init_cmd: Receiving commands
[dcbz:23975] Global-Local) base:ckpt_init_cmd: Received [0, 0, [INVALID]]
[dcbz:23975] Global) request_cmd(): Checkpointing currently disabled, rejecting request
[dcbz:23975] 23975: Failed to checkpoint process [45520,0].
[dcbz:23975] Global-Local) base:ckpt_update_cmd: Sending update command <status 0>
[dcbz:23975] Global-Local) base:ckpt_update_cmd: Sending update command <status 0> + <ref (null)> <seq -1>
[dcbz:23975] [[45520,0],0] rml_send_buffer to peer [[45509,0],0] at tag 13
[dcbz:23975] Global) Startup Command Line Channel
[dcbz:23975] [[45520,0],0] rml_recv_buffer_nb for peer [[WILDCARD],WILDCARD] tag 13
[dcbz:23975] [[45520,0],0] rml_send_msg to peer [[45509,0],0] at tag 13
[dcbz:23975] [[45520,0],0] posting recv
[dcbz:23975] [[45520,0],0] posting non-persistent recv on tag 13 for peer [[WILDCARD],WILDCARD]
[dcbz:23975] [[45520,0],0]-[[45509,0],0] Send message complete at ../../../../../orte/mca/oob/tcp/oob_tcp_sendrecv.c:220

It's still not working but at least both processes are
talking to each other which is good.

                Adrian

On Thu, Jan 23, 2014 at 11:27:42AM -0600, Josh Hursey wrote:
> +1
>
>
> On Thu, Jan 23, 2014 at 10:16 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>
> > Looks correct to me - you are right in that you cannot release the buffer
> > until after the send completes. We don't copy the data underneath to save
> > memory and time.
> >
> >
> > On Jan 23, 2014, at 6:51 AM, Adrian Reber <adrian_at_[hidden]> wrote:
> >
> > > Following patch makes orte-checkpoint communicate with orterun again:
> > >
> > > diff --git a/orte/tools/orte-checkpoint/orte-checkpoint.c
> > b/orte/tools/orte-checkpoint/orte-checkpoint.c
> > > index 7106342..8539f34 100644
> > > --- a/orte/tools/orte-checkpoint/orte-checkpoint.c
> > > +++ b/orte/tools/orte-checkpoint/orte-checkpoint.c
> > > @@ -834,7 +834,7 @@ static int
> > notify_process_for_checkpoint(opal_crs_base_ckpt_options_t *options)
> > > }
> > >
> > > if (ORTE_SUCCESS != (ret =
> > orte_rml.send_buffer_nb(&(orterun_hnp->name), buffer,
> > > -
> > ORTE_RML_TAG_CKPT, hnp_receiver,
> > > +
> > ORTE_RML_TAG_CKPT, orte_rml_send_callback,
> > > NULL))) {
> > > exit_status = ret;
> > > goto cleanup;
> > > @@ -845,11 +845,6 @@ static int
> > notify_process_for_checkpoint(opal_crs_base_ckpt_options_t *options)
> > > ORTE_JOBID_PRINT(jobid));
> > >
> > > cleanup:
> > > - if( NULL != buffer) {
> > > - OBJ_RELEASE(buffer);
> > > - buffer = NULL;
> > > - }
> > > -
> > > if( ORTE_SUCCESS != exit_status ) {
> > > opal_show_help("help-orte-checkpoint.txt", "unable_to_connect",
> > true,
> > > orte_checkpoint_globals.pid);
> > >
> > >
> > > Before committing the code into the repository I wanted to make
> > > sure it is the correct way to fix it.
> > >
> > > The first change changes the callback to orte_rml_send_callback().
> > > When I initially made the code compile again I used hnp_receiver()
> > > to change the code from blocking to non-blocking and that was
> > > wrong.
> > >
> > > The second change (removal of OBJ_RELEASE(buffer)) is necessary
> > > because this seems to delete buffer during communication and then
> > > everything breaks badly.
> > >
> > > Adrian
> > > _______________________________________________
> > > devel mailing list
> > > devel_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
>
>
>
> --
> Joshua Hursey
> Assistant Professor of Computer Science
> University of Wisconsin-La Crosse
> http://cs.uwlax.edu/~jjhursey

> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

                Adrian

-- 
Adrian Reber <adrian_at_[hidden]>            http://lisas.de/~adrian/
Bing's Rule:
	Don't try to stem the tide -- move the beach.