Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [PATCH] make orte-checkpoint communicate with orterun again
From: Adrian Reber (adrian_at_[hidden])
Date: 2014-01-24 12:35:53


Status update of C/R with Open MPI:

With the last two patches applied I am now seeing communication
between orte-checkpoint and orterun:

orte-checkpoint 23975:

[dcbz:23986] orte_checkpoint: Checkpointing...
[dcbz:23986] PID 23975
[dcbz:23986] Connected to Mpirun [[45520,0],0]
[dcbz:23986] orte_checkpoint: notify_hnp: Contact Head Node Process PID 23975
[dcbz:23986] [[45509,0],0] rml_send_buffer to peer [[45520,0],0] at tag 13
[dcbz:23986] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID]
[dcbz:23986] [[45509,0],0] posting recv
[dcbz:23986] [[45509,0],0] posting persistent recv on tag 9 for peer [[WILDCARD],WILDCARD]
[dcbz:23986] [[45509,0],0] posting recv
[dcbz:23986] [[45509,0],0] posting persistent recv on tag 13 for peer [[WILDCARD],WILDCARD]
[dcbz:23986] [[45509,0],0] rml_send_msg to peer [[45520,0],0] at tag 13
[dcbz:23986] [[45509,0],0]-[[45520,0],0] Send message complete at ../../../../../orte/mca/oob/tcp/oob_tcp_sendrecv.c:220
[dcbz:23986] [[45509,0],0] Message posted at ../../../../../orte/mca/oob/tcp/oob_tcp_sendrecv.c:519
[dcbz:23986] [[45509,0],0] message received 39 bytes from [[45520,0],0] for tag 13
[dcbz:23986] orte_checkpoint: hnp_receiver: Receive a command message.
[dcbz:23986] orte_checkpoint: hnp_receiver: Status Update.
--------------------------------------------------------------------------
Error: The application (PID = 23975) failed to checkpoint properly.
       Returned -1.
--------------------------------------------------------------------------

orterun:

[dcbz:23975] [[45520,0],0] Message posted at ../../../../../orte/mca/oob/tcp/oob_tcp_sendrecv.c:519
[dcbz:23975] [[45520,0],0] message received 50 bytes from [[45509,0],0] for tag 13
[dcbz:23975] Global) Command Line: Start a checkpoint operation [Sender = [[45509,0],0]]
[dcbz:23975] Global) Command line requested a checkpoint [command 1]
[dcbz:23975] Global-Local) base:ckpt_init_cmd: Receiving commands
[dcbz:23975] Global-Local) base:ckpt_init_cmd: Received [0, 0, [INVALID]]
[dcbz:23975] Global) request_cmd(): Checkpointing currently disabled, rejecting request
[dcbz:23975] 23975: Failed to checkpoint process [45520,0].
[dcbz:23975] Global-Local) base:ckpt_update_cmd: Sending update command <status 0>
[dcbz:23975] Global-Local) base:ckpt_update_cmd: Sending update command <status 0> + <ref (null)> <seq -1>
[dcbz:23975] [[45520,0],0] rml_send_buffer to peer [[45509,0],0] at tag 13
[dcbz:23975] Global) Startup Command Line Channel
[dcbz:23975] [[45520,0],0] rml_recv_buffer_nb for peer [[WILDCARD],WILDCARD] tag 13
[dcbz:23975] [[45520,0],0] rml_send_msg to peer [[45509,0],0] at tag 13
[dcbz:23975] [[45520,0],0] posting recv
[dcbz:23975] [[45520,0],0] posting non-persistent recv on tag 13 for peer [[WILDCARD],WILDCARD]
[dcbz:23975] [[45520,0],0]-[[45509,0],0] Send message complete at ../../../../../orte/mca/oob/tcp/oob_tcp_sendrecv.c:220

It's still not working but at least both processes are
talking to each other which is good.

                Adrian

On Thu, Jan 23, 2014 at 11:27:42AM -0600, Josh Hursey wrote:
> +1
>
>
> On Thu, Jan 23, 2014 at 10:16 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>
> > Looks correct to me - you are right in that you cannot release the buffer
> > until after the send completes. We don't copy the data underneath to save
> > memory and time.
> >
> >
> > On Jan 23, 2014, at 6:51 AM, Adrian Reber <adrian_at_[hidden]> wrote:
> >
> > > Following patch makes orte-checkpoint communicate with orterun again:
> > >
> > > diff --git a/orte/tools/orte-checkpoint/orte-checkpoint.c
> > b/orte/tools/orte-checkpoint/orte-checkpoint.c
> > > index 7106342..8539f34 100644
> > > --- a/orte/tools/orte-checkpoint/orte-checkpoint.c
> > > +++ b/orte/tools/orte-checkpoint/orte-checkpoint.c
> > > @@ -834,7 +834,7 @@ static int
> > notify_process_for_checkpoint(opal_crs_base_ckpt_options_t *options)
> > > }
> > >
> > > if (ORTE_SUCCESS != (ret =
> > orte_rml.send_buffer_nb(&(orterun_hnp->name), buffer,
> > > -
> > ORTE_RML_TAG_CKPT, hnp_receiver,
> > > +
> > ORTE_RML_TAG_CKPT, orte_rml_send_callback,
> > > NULL))) {
> > > exit_status = ret;
> > > goto cleanup;
> > > @@ -845,11 +845,6 @@ static int
> > notify_process_for_checkpoint(opal_crs_base_ckpt_options_t *options)
> > > ORTE_JOBID_PRINT(jobid));
> > >
> > > cleanup:
> > > - if( NULL != buffer) {
> > > - OBJ_RELEASE(buffer);
> > > - buffer = NULL;
> > > - }
> > > -
> > > if( ORTE_SUCCESS != exit_status ) {
> > > opal_show_help("help-orte-checkpoint.txt", "unable_to_connect",
> > true,
> > > orte_checkpoint_globals.pid);
> > >
> > >
> > > Before committing the code into the repository I wanted to make
> > > sure it is the correct way to fix it.
> > >
> > > The first change changes the callback to orte_rml_send_callback().
> > > When I initially made the code compile again I used hnp_receiver()
> > > to change the code from blocking to non-blocking and that was
> > > wrong.
> > >
> > > The second change (removal of OBJ_RELEASE(buffer)) is necessary
> > > because this seems to delete buffer during communication and then
> > > everything breaks badly.
> > >
> > > Adrian
> > > _______________________________________________
> > > devel mailing list
> > > devel_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
>
>
>
> --
> Joshua Hursey
> Assistant Professor of Computer Science
> University of Wisconsin-La Crosse
> http://cs.uwlax.edu/~jjhursey

> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

                Adrian

-- 
Adrian Reber <adrian_at_[hidden]>            http://lisas.de/~adrian/
Bing's Rule:
	Don't try to stem the tide -- move the beach.