On 8/18/2011 2:32 PM, George Bosilca wrote:
Terry,

The test succeeded in both of your runs.
Not really.  Granted the test aborted  in both cases however the case you show below has further issues while the orte is trying to clean things up.  It certainly is not what I would call friendly.  But that is besides the point, the issue is orte is having  issues with MPI_Errhandler_fatal_c test IMO and it looks like you have seen the same failure prior to the epoch changes.  Fair enough, I'll go back to the drawing board and see if I can narrow this down.

--td
However, I rolled back before the epoch change (24814) and the output is the following:

MPITEST info  (0): Starting MPI_Errhandler_fatal test
MPITEST info  (0): This test will abort after printing the results message
MPITEST info  (0): If it does not, then a f.a.i.l.u.r.e will be noted
[dancer.eecs.utk.edu:16098] *** An error occurred in MPI_Send
[dancer.eecs.utk.edu:16098] *** reported by process [766095392769,139869904961537]
[dancer.eecs.utk.edu:16098] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[dancer.eecs.utk.edu:16098] *** MPI_ERR_RANK: invalid rank
[dancer.eecs.utk.edu:16098] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[dancer.eecs.utk.edu:16098] ***    and potentially your MPI job)
MPITEST_results: MPI_Errhandler_fatal all tests PASSED (4)
[dancer.eecs.utk.edu:16096] [[24280,0],0]-[[24280,1],3] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[dancer.eecs.utk.edu:16096] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[dancer.eecs.utk.edu:16096] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

As you can see it is identical to the output in your test.

  george.


On Aug 18, 2011, at 12:29 , TERRY DONTJE wrote:

Just ran MPI_Errhandler_fatal_c with r25063 and it still fails.  Everything is the same except I don't see the "readv failed.." message.

Have your tried to run this code yourself?  It is pretty simple and fails with one node using np=4.

--td

On 8/18/2011 10:57 AM, Wesley Bland wrote:
I just checked in a fix (I hope). I think the problem was that the errmgr
was removing children from the list of odls children without using the
mutex to prevent race conditions. Let me know if the MTT is still having
problems tomorrow.

Wes


I am seeing the intel test suite tests MPI_Errhandler_fatal_c and
MPI_Errhandler_fatal_f fail with an oob failure quite a bit  I have not
seen this test failing under MTT until the epoch code was added.  So I
have a suspicion the epoch code might be at fault.  Could someone
familiar with the epoch changes (Wesley) take a look at this failure.

Note this intermittently fails but fails for me more times than not.
Attached is a log file of a run that succeeds followed by the failing
run.  The piece of concern are the messages involving
mca_oob_tcp_msg_recv and below.

thanks,

--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email 
terry.dontje@oracle.com <mailto:terry.dontje@oracle.com>






-- 
<Mail Attachment.gif>
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle - Performance Technologies
95 Network Drive, Burlington, MA 01803
Email terry.dontje@oracle.com



_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle - Performance Technologies
95 Network Drive, Burlington, MA 01803
Email terry.dontje@oracle.com