Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: Resilient ORTE
From: Wesley Bland (wbland_at_[hidden])
Date: 2011-06-23 15:34:21


Right. Sorry I misspoke.

On Thursday, June 23, 2011 at 3:32 PM, Ralph Castain wrote:

> Ummm...just to clarify. There are no threads in ORTE, so it wasn't a problem of "not giving up the thread". The problem was that Josh's test never called progress. It would have been equally okay to simply call "opal_event_dispatch" while waiting for the callback.
>
> All applications have to cycle the progress engine.
>
>
> On Jun 23, 2011, at 1:18 PM, Wesley Bland wrote:
> > Josh,
> >
> > There were a couple of bugs that I cleared up in my most recent checkin, but I also needed to modify your test. The callback for the application layer errmgr actually occurs in the application layer. Your test was never giving up the thread to the ORTE application event loop to receive its message from the ORTED. I changed your while loop to an ORTE_PROGRESSED_WAIT and that fixed the problem.
> >
> > Try running the attached code with the modifications and see if that clears up the problem. It did for me.
> >
> > Thanks,
> > Wesley
> >
> > On Thursday, June 23, 2011 at 10:16 AM, Josh Hursey wrote:
> >
> > > So I finally got a chance to test the branch this morning. I cannot
> > > get it to work. Maybe I'm doing some wrong, missing some MCA
> > > parameter?
> > >
> > > -------------------------
> > > [jjhursey_at_smoky-login1 resilient-orte] hg summary
> > > parent: 2:c550cf6ed6a2 tip
> > > Newest version. Synced with trunk r24785.
> > > branch: default
> > > commit: 1 modified, 8097 unknown
> > > update: (current)
> > > -------------------------
> > > (the 1 modified was the test program attached)
> > >
> > > Attached is a modified version of the orte_abort.c program found in
> > > ${top}/orte/test/system. This program is ORTE only, and registers the
> > > errmgr callback to trigger correct termination. You will need to
> > > configure Open MPI with '--with-devel-headers' to build this. But then
> > > you can compile with:
> > > ortecc -g orte_abort.c -o orte_abort
> > >
> > > These are the configure options that I used:
> > > --with-devel-headers --enable-binaries --disable-io-romio
> > > --enable-contrib-no-build=vt --enable-debug CC=gcc CXX=g++
> > > F77=gfortran FC=gfortran
> > >
> > >
> > > If the HNP has no processes on it - I get a hang:
> > > -------------------------------
> > > mpirun -np 4 --nolocal orte_abort
> > > orte_abort: Name [[60121,1],0,0] Host: smoky13 Pid 3688 -- Initalized
> > > orte_abort: Name [[60121,1],1,0] Host: smoky13 Pid 3689 -- Initalized
> > > orte_abort: Name [[60121,1],2,0] Host: smoky13 Pid 3690 -- Initalized
> > > orte_abort: Name [[60121,1],3,0] Host: smoky13 Pid 3691 -- Initalized
> > > orte_abort: Name [[60121,1],3,0] Host: smoky13 Pid 3691 -- Calling Abort
> > > mpirun: killing job...
> > >
> > > [smoky14:04002] [[60121,0],0,0] ORTE_ERROR_LOG: Data unpack would read
> > > past end of buffer in file errmgr_hnp.c at line 824
> > > [smoky14:04002] [[60121,0],0,0] ORTE_ERROR_LOG: Data unpack would read
> > > past end of buffer in file orted/orted_comm.c at line 1341
> > > mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate
> > >
> > > [jjhursey_at_smoky14 system] echo $?
> > > 1
> > > -------------------------------
> > >
> > > If the HNP has processes on it, but not the one that aborted - I get a hang:
> > > -------------------------------
> > > [jjhursey_at_smoky14 system] mpirun -np 4 --npernode 2 orte_abort
> > > orte_abort: Name [[60302,1],0,0] Host: smoky14 Pid 3830 -- Initalized
> > > orte_abort: Name [[60302,1],1,0] Host: smoky14 Pid 3831 -- Initalized
> > > orte_abort: Name [[60302,1],2,0] Host: smoky13 Pid 3484 -- Initalized
> > > orte_abort: Name [[60302,1],3,0] Host: smoky13 Pid 3485 -- Initalized
> > > orte_abort: Name [[60302,1],3,0] Host: smoky13 Pid 3485 -- Calling Abort
> > > mpirun: killing job...
> > >
> > > [smoky14:03829] [[60302,0],0,0]-[[60302,1],1,0] mca_oob_tcp_msg_recv:
> > > readv failed: Connection reset by peer (104)
> > > [smoky14:03829] [[60302,0],0,0]-[[60302,1],0,0] mca_oob_tcp_msg_recv:
> > > readv failed: Connection reset by peer (104)
> > > [smoky14:03829] [[60302,0],0,0] ORTE_ERROR_LOG: Data unpack would read
> > > past end of buffer in file errmgr_hnp.c at line 824
> > > [smoky14:03829] [[60302,0],0,0] ORTE_ERROR_LOG: Data unpack would read
> > > past end of buffer in file orted/orted_comm.c at line 1341
> > > mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate
> > >
> > > [jjhursey_at_smoky14 system] echo $?
> > > 1
> > > --------------------------------
> > >
> > > If the HNP has processes on it, and it is the one that aborted - I get
> > > immediate return, but no callback:
> > > --------------------------------
> > > [jjhursey_at_smoky14 system] mpirun -np 4 --npernode 4 orte_abort
> > > orte_abort: Name [[60292,1],0,0] Host: smoky14 Pid 3840 -- Initalized
> > > orte_abort: Name [[60292,1],1,0] Host: smoky14 Pid 3841 -- Initalized
> > > orte_abort: Name [[60292,1],2,0] Host: smoky14 Pid 3842 -- Initalized
> > > orte_abort: Name [[60292,1],3,0] Host: smoky14 Pid 3843 -- Initalized
> > > orte_abort: Name [[60292,1],3,0] Host: smoky14 Pid 3843 -- Calling Abort
> > > [jjhursey_at_smoky14 system] echo $?
> > > 3
> > > --------------------------------
> > >
> > > Any ideas on what I might be doing wrong?
> > >
> > > I tried with both calling 'orte_errmgr.abort(ORTE_PROC_MY_NAME->vpid,
> > > NULL);' and 'kill(getpid(), SIGKILL);' and got the same behavior.
> > >
> > > -- Josh
> > >
> > >
> > >
> > > On Thu, Jun 23, 2011 at 9:58 AM, Wesley Bland <wbland_at_[hidden] (mailto:wbland_at_[hidden])> wrote:
> > > > Last reminder (I hope). RFC goes in a COB today.
> > > > Wesley
> > > > _______________________________________________
> > > > devel mailing list
> > > > devel_at_[hidden] (mailto:devel_at_[hidden])
> > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >
> > >
> > >
> > > --
> > > Joshua Hursey
> > > Postdoctoral Research Associate
> > > Oak Ridge National Laboratory
> > > http://users.nccs.gov/~jjhursey
> > > _______________________________________________
> > > devel mailing list
> > > devel_at_[hidden] (mailto:devel_at_[hidden])
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >
> > > Attachments:
> > > - orte_abort.c
> > >
> > >
> >
> >
> > <orte_abort.c>_______________________________________________
> > devel mailing list
> > devel_at_[hidden] (mailto:devel_at_[hidden])
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> _______________________________________________
> devel mailing list
> devel_at_[hidden] (mailto:devel_at_[hidden])
> http://www.open-mpi.org/mailman/listinfo.cgi/devel