Ummm...just to clarify. There are no threads in ORTE, so it wasn't a problem of "not giving up the thread". The problem was that Josh's test never called progress. It would have been equally okay to simply call "opal_event_dispatch" while waiting for the callback.

All applications have to cycle the progress engine.


On Jun 23, 2011, at 1:18 PM, Wesley Bland wrote:

Josh,

There were a couple of bugs that I cleared up in my most recent checkin, but I also needed to modify your test. The callback for the application layer errmgr actually occurs in the application layer. Your test was never giving up the thread to the ORTE application event loop to receive its message from the ORTED. I changed your while loop to an ORTE_PROGRESSED_WAIT and that fixed the problem.

Try running the attached code with the modifications and see if that clears up the problem. It did for me.

Thanks,
Wesley

On Thursday, June 23, 2011 at 10:16 AM, Josh Hursey wrote:

So I finally got a chance to test the branch this morning. I cannot
get it to work. Maybe I'm doing some wrong, missing some MCA
parameter?

-------------------------
[jjhursey@smoky-login1 resilient-orte] hg summary
parent: 2:c550cf6ed6a2 tip
Newest version. Synced with trunk r24785.
branch: default
commit: 1 modified, 8097 unknown
update: (current)
-------------------------
(the 1 modified was the test program attached)

Attached is a modified version of the orte_abort.c program found in
${top}/orte/test/system. This program is ORTE only, and registers the
errmgr callback to trigger correct termination. You will need to
configure Open MPI with '--with-devel-headers' to build this. But then
you can compile with:
ortecc -g orte_abort.c -o orte_abort

These are the configure options that I used:
--with-devel-headers --enable-binaries --disable-io-romio
--enable-contrib-no-build=vt --enable-debug CC=gcc CXX=g++
F77=gfortran FC=gfortran


If the HNP has no processes on it - I get a hang:
-------------------------------
mpirun -np 4 --nolocal orte_abort
orte_abort: Name [[60121,1],0,0] Host: smoky13 Pid 3688 -- Initalized
orte_abort: Name [[60121,1],1,0] Host: smoky13 Pid 3689 -- Initalized
orte_abort: Name [[60121,1],2,0] Host: smoky13 Pid 3690 -- Initalized
orte_abort: Name [[60121,1],3,0] Host: smoky13 Pid 3691 -- Initalized
orte_abort: Name [[60121,1],3,0] Host: smoky13 Pid 3691 -- Calling Abort
mpirun: killing job...

[smoky14:04002] [[60121,0],0,0] ORTE_ERROR_LOG: Data unpack would read
past end of buffer in file errmgr_hnp.c at line 824
[smoky14:04002] [[60121,0],0,0] ORTE_ERROR_LOG: Data unpack would read
past end of buffer in file orted/orted_comm.c at line 1341
mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate

[jjhursey@smoky14 system] echo $?
1
-------------------------------

If the HNP has processes on it, but not the one that aborted - I get a hang:
-------------------------------
[jjhursey@smoky14 system] mpirun -np 4 --npernode 2 orte_abort
orte_abort: Name [[60302,1],0,0] Host: smoky14 Pid 3830 -- Initalized
orte_abort: Name [[60302,1],1,0] Host: smoky14 Pid 3831 -- Initalized
orte_abort: Name [[60302,1],2,0] Host: smoky13 Pid 3484 -- Initalized
orte_abort: Name [[60302,1],3,0] Host: smoky13 Pid 3485 -- Initalized
orte_abort: Name [[60302,1],3,0] Host: smoky13 Pid 3485 -- Calling Abort
mpirun: killing job...

[smoky14:03829] [[60302,0],0,0]-[[60302,1],1,0] mca_oob_tcp_msg_recv:
readv failed: Connection reset by peer (104)
[smoky14:03829] [[60302,0],0,0]-[[60302,1],0,0] mca_oob_tcp_msg_recv:
readv failed: Connection reset by peer (104)
[smoky14:03829] [[60302,0],0,0] ORTE_ERROR_LOG: Data unpack would read
past end of buffer in file errmgr_hnp.c at line 824
[smoky14:03829] [[60302,0],0,0] ORTE_ERROR_LOG: Data unpack would read
past end of buffer in file orted/orted_comm.c at line 1341
mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate

[jjhursey@smoky14 system] echo $?
1
--------------------------------

If the HNP has processes on it, and it is the one that aborted - I get
immediate return, but no callback:
--------------------------------
[jjhursey@smoky14 system] mpirun -np 4 --npernode 4 orte_abort
orte_abort: Name [[60292,1],0,0] Host: smoky14 Pid 3840 -- Initalized
orte_abort: Name [[60292,1],1,0] Host: smoky14 Pid 3841 -- Initalized
orte_abort: Name [[60292,1],2,0] Host: smoky14 Pid 3842 -- Initalized
orte_abort: Name [[60292,1],3,0] Host: smoky14 Pid 3843 -- Initalized
orte_abort: Name [[60292,1],3,0] Host: smoky14 Pid 3843 -- Calling Abort
[jjhursey@smoky14 system] echo $?
3
--------------------------------

Any ideas on what I might be doing wrong?

I tried with both calling 'orte_errmgr.abort(ORTE_PROC_MY_NAME->vpid,
NULL);' and 'kill(getpid(), SIGKILL);' and got the same behavior.

-- Josh



On Thu, Jun 23, 2011 at 9:58 AM, Wesley Bland <wbland@eecs.utk.edu> wrote:
Last reminder (I hope). RFC goes in a COB today.
Wesley
_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey
_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Attachments:
- orte_abort.c

<orte_abort.c>_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel