The test succeeded in both of your runs.
However, I rolled back before the epoch change (24814) and the output is the following:
MPITEST info (0): Starting MPI_Errhandler_fatal test
MPITEST info (0): This test will abort after printing the results message
MPITEST info (0): If it does not, then a f.a.i.l.u.r.e will be noted
[dancer.eecs.utk.edu:16098] *** An error occurred in MPI_Send
[dancer.eecs.utk.edu:16098] *** reported by process [766095392769,139869904961537]
[dancer.eecs.utk.edu:16098] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[dancer.eecs.utk.edu:16098] *** MPI_ERR_RANK: invalid rank
[dancer.eecs.utk.edu:16098] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[dancer.eecs.utk.edu:16098] *** and potentially your MPI job)
MPITEST_results: MPI_Errhandler_fatal all tests PASSED (4)
[dancer.eecs.utk.edu:16096] [[24280,0],0]-[[24280,1],3] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[dancer.eecs.utk.edu:16096] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[dancer.eecs.utk.edu:16096] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
As you can see it is identical to the output in your test.
On Aug 18, 2011, at 12:29 , TERRY DONTJE wrote:
> Just ran MPI_Errhandler_fatal_c with r25063 and it still fails. Everything is the same except I don't see the "readv failed.." message.
> Have your tried to run this code yourself? It is pretty simple and fails with one node using np=4.
> On 8/18/2011 10:57 AM, Wesley Bland wrote:
>> I just checked in a fix (I hope). I think the problem was that the errmgr
>> was removing children from the list of odls children without using the
>> mutex to prevent race conditions. Let me know if the MTT is still having
>> problems tomorrow.
>>> I am seeing the intel test suite tests MPI_Errhandler_fatal_c and
>>> MPI_Errhandler_fatal_f fail with an oob failure quite a bit I have not
>>> seen this test failing under MTT until the epoch code was added. So I
>>> have a suspicion the epoch code might be at fault. Could someone
>>> familiar with the epoch changes (Wesley) take a look at this failure.
>>> Note this intermittently fails but fails for me more times than not.
>>> Attached is a log file of a run that succeeds followed by the failing
>>> run. The piece of concern are the messages involving
>>> mca_oob_tcp_msg_recv and below.
>>> Terry D. Dontje | Principal Software Engineer
>>> Developer Tools Engineering | +1.781.442.2631
>>> Oracle *- Performance Technologies*
>>> 95 Network Drive, Burlington, MA 01803
>>> terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>
> <Mail Attachment.gif>
> Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.781.442.2631
> Oracle - Performance Technologies
> 95 Network Drive, Burlington, MA 01803
> Email terry.dontje_at_[hidden]
> devel mailing list