Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] MPI_Errhandler_fatal_c failure
From: George Bosilca (bosilca_at_[hidden])
Date: 2011-08-18 14:32:21


Terry,

The test succeeded in both of your runs.

However, I rolled back before the epoch change (24814) and the output is the following:

MPITEST info (0): Starting MPI_Errhandler_fatal test
MPITEST info (0): This test will abort after printing the results message
MPITEST info (0): If it does not, then a f.a.i.l.u.r.e will be noted
[dancer.eecs.utk.edu:16098] *** An error occurred in MPI_Send
[dancer.eecs.utk.edu:16098] *** reported by process [766095392769,139869904961537]
[dancer.eecs.utk.edu:16098] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[dancer.eecs.utk.edu:16098] *** MPI_ERR_RANK: invalid rank
[dancer.eecs.utk.edu:16098] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[dancer.eecs.utk.edu:16098] *** and potentially your MPI job)
MPITEST_results: MPI_Errhandler_fatal all tests PASSED (4)
[dancer.eecs.utk.edu:16096] [[24280,0],0]-[[24280,1],3] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[dancer.eecs.utk.edu:16096] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[dancer.eecs.utk.edu:16096] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

As you can see it is identical to the output in your test.

  george.

On Aug 18, 2011, at 12:29 , TERRY DONTJE wrote:

> Just ran MPI_Errhandler_fatal_c with r25063 and it still fails. Everything is the same except I don't see the "readv failed.." message.
>
> Have your tried to run this code yourself? It is pretty simple and fails with one node using np=4.
>
> --td
>
> On 8/18/2011 10:57 AM, Wesley Bland wrote:
>> I just checked in a fix (I hope). I think the problem was that the errmgr
>> was removing children from the list of odls children without using the
>> mutex to prevent race conditions. Let me know if the MTT is still having
>> problems tomorrow.
>>
>> Wes
>>
>>
>>> I am seeing the intel test suite tests MPI_Errhandler_fatal_c and
>>> MPI_Errhandler_fatal_f fail with an oob failure quite a bit I have not
>>> seen this test failing under MTT until the epoch code was added. So I
>>> have a suspicion the epoch code might be at fault. Could someone
>>> familiar with the epoch changes (Wesley) take a look at this failure.
>>>
>>> Note this intermittently fails but fails for me more times than not.
>>> Attached is a log file of a run that succeeds followed by the failing
>>> run. The piece of concern are the messages involving
>>> mca_oob_tcp_msg_recv and below.
>>>
>>> thanks,
>>>
>>> --
>>> Oracle
>>> Terry D. Dontje | Principal Software Engineer
>>> Developer Tools Engineering | +1.781.442.2631
>>> Oracle *- Performance Technologies*
>>> 95 Network Drive, Burlington, MA 01803
>>> Email
>>> terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>
>>>
>>>
>>>
>>>
>>>
>>>
>
> --
> <Mail Attachment.gif>
> Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.781.442.2631
> Oracle - Performance Technologies
> 95 Network Drive, Burlington, MA 01803
> Email terry.dontje_at_[hidden]
>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel