Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: Resilient ORTE
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2011-06-23 15:56:16


Ga - what a rookie mistake :)

I tested the patched test and it works as advertised for the small
scale tests I used before. So I'm good with this going in today.

Thanks,
Josh

On Thu, Jun 23, 2011 at 3:34 PM, Wesley Bland <wbland_at_[hidden]> wrote:
> Right. Sorry I misspoke.
>
> On Thursday, June 23, 2011 at 3:32 PM, Ralph Castain wrote:
>
> Ummm...just to clarify. There are no threads in ORTE, so it wasn't a problem
> of "not giving up the thread". The problem was that Josh's test never called
> progress. It would have been equally okay to simply call
> "opal_event_dispatch" while waiting for the callback.
> All applications have to cycle the progress engine.
>
> On Jun 23, 2011, at 1:18 PM, Wesley Bland wrote:
>
> Josh,
> There were a couple of bugs that I cleared up in my most recent checkin, but
> I also needed to modify your test. The callback for the application layer
> errmgr actually occurs in the application layer. Your test was never giving
> up the thread to the ORTE application event loop to receive its message from
> the ORTED. I changed your while loop to an ORTE_PROGRESSED_WAIT and that
> fixed the problem.
> Try running the attached code with the modifications and see if that clears
> up the problem. It did for me.
> Thanks,
> Wesley
>
> On Thursday, June 23, 2011 at 10:16 AM, Josh Hursey wrote:
>
> So I finally got a chance to test the branch this morning. I cannot
> get it to work. Maybe I'm doing some wrong, missing some MCA
> parameter?
>
> -------------------------
> [jjhursey_at_smoky-login1 resilient-orte] hg summary
> parent: 2:c550cf6ed6a2 tip
> Newest version. Synced with trunk r24785.
> branch: default
> commit: 1 modified, 8097 unknown
> update: (current)
> -------------------------
> (the 1 modified was the test program attached)
>
> Attached is a modified version of the orte_abort.c program found in
> ${top}/orte/test/system. This program is ORTE only, and registers the
> errmgr callback to trigger correct termination. You will need to
> configure Open MPI with '--with-devel-headers' to build this. But then
> you can compile with:
> ortecc -g orte_abort.c -o orte_abort
>
> These are the configure options that I used:
> --with-devel-headers --enable-binaries --disable-io-romio
> --enable-contrib-no-build=vt --enable-debug CC=gcc CXX=g++
> F77=gfortran FC=gfortran
>
>
> If the HNP has no processes on it - I get a hang:
> -------------------------------
> mpirun -np 4 --nolocal orte_abort
> orte_abort: Name [[60121,1],0,0] Host: smoky13 Pid 3688 -- Initalized
> orte_abort: Name [[60121,1],1,0] Host: smoky13 Pid 3689 -- Initalized
> orte_abort: Name [[60121,1],2,0] Host: smoky13 Pid 3690 -- Initalized
> orte_abort: Name [[60121,1],3,0] Host: smoky13 Pid 3691 -- Initalized
> orte_abort: Name [[60121,1],3,0] Host: smoky13 Pid 3691 -- Calling Abort
> mpirun: killing job...
>
> [smoky14:04002] [[60121,0],0,0] ORTE_ERROR_LOG: Data unpack would read
> past end of buffer in file errmgr_hnp.c at line 824
> [smoky14:04002] [[60121,0],0,0] ORTE_ERROR_LOG: Data unpack would read
> past end of buffer in file orted/orted_comm.c at line 1341
> mpirun: abort is already in progress...hit ctrl-c again to forcibly
> terminate
>
> [jjhursey_at_smoky14 system] echo $?
> 1
> -------------------------------
>
> If the HNP has processes on it, but not the one that aborted - I get a hang:
> -------------------------------
> [jjhursey_at_smoky14 system] mpirun -np 4 --npernode 2 orte_abort
> orte_abort: Name [[60302,1],0,0] Host: smoky14 Pid 3830 -- Initalized
> orte_abort: Name [[60302,1],1,0] Host: smoky14 Pid 3831 -- Initalized
> orte_abort: Name [[60302,1],2,0] Host: smoky13 Pid 3484 -- Initalized
> orte_abort: Name [[60302,1],3,0] Host: smoky13 Pid 3485 -- Initalized
> orte_abort: Name [[60302,1],3,0] Host: smoky13 Pid 3485 -- Calling Abort
> mpirun: killing job...
>
> [smoky14:03829] [[60302,0],0,0]-[[60302,1],1,0] mca_oob_tcp_msg_recv:
> readv failed: Connection reset by peer (104)
> [smoky14:03829] [[60302,0],0,0]-[[60302,1],0,0] mca_oob_tcp_msg_recv:
> readv failed: Connection reset by peer (104)
> [smoky14:03829] [[60302,0],0,0] ORTE_ERROR_LOG: Data unpack would read
> past end of buffer in file errmgr_hnp.c at line 824
> [smoky14:03829] [[60302,0],0,0] ORTE_ERROR_LOG: Data unpack would read
> past end of buffer in file orted/orted_comm.c at line 1341
> mpirun: abort is already in progress...hit ctrl-c again to forcibly
> terminate
>
> [jjhursey_at_smoky14 system] echo $?
> 1
> --------------------------------
>
> If the HNP has processes on it, and it is the one that aborted - I get
> immediate return, but no callback:
> --------------------------------
> [jjhursey_at_smoky14 system] mpirun -np 4 --npernode 4 orte_abort
> orte_abort: Name [[60292,1],0,0] Host: smoky14 Pid 3840 -- Initalized
> orte_abort: Name [[60292,1],1,0] Host: smoky14 Pid 3841 -- Initalized
> orte_abort: Name [[60292,1],2,0] Host: smoky14 Pid 3842 -- Initalized
> orte_abort: Name [[60292,1],3,0] Host: smoky14 Pid 3843 -- Initalized
> orte_abort: Name [[60292,1],3,0] Host: smoky14 Pid 3843 -- Calling Abort
> [jjhursey_at_smoky14 system] echo $?
> 3
> --------------------------------
>
> Any ideas on what I might be doing wrong?
>
> I tried with both calling 'orte_errmgr.abort(ORTE_PROC_MY_NAME->vpid,
> NULL);' and 'kill(getpid(), SIGKILL);' and got the same behavior.
>
> -- Josh
>
>
>
> On Thu, Jun 23, 2011 at 9:58 AM, Wesley Bland <wbland_at_[hidden]> wrote:
>
> Last reminder (I hope). RFC goes in a COB today.
> Wesley
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> --
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> Attachments:
> - orte_abort.c
>
> <orte_abort.c>_______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey