Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] MPI_Errhandler_fatal_c failure
From: TERRY DONTJE (terry.dontje_at_[hidden])
Date: 2011-08-18 15:37:46


Thought I'd throw this out there, I retraced my MTT steps and did find
that there were failures of this test back until r24774. r24775 has a
comment that looks very relevant. I am talking to the committer of that
change now.

Sorry for the false accusation.

--td

On 8/18/2011 2:32 PM, George Bosilca wrote:
> Terry,
>
> The test succeeded in both of your runs.
>
> However, I rolled back before the epoch change (24814) and the output is the following:
>
> MPITEST info (0): Starting MPI_Errhandler_fatal test
> MPITEST info (0): This test will abort after printing the results message
> MPITEST info (0): If it does not, then a f.a.i.l.u.r.e will be noted
> [dancer.eecs.utk.edu:16098] *** An error occurred in MPI_Send
> [dancer.eecs.utk.edu:16098] *** reported by process [766095392769,139869904961537]
> [dancer.eecs.utk.edu:16098] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
> [dancer.eecs.utk.edu:16098] *** MPI_ERR_RANK: invalid rank
> [dancer.eecs.utk.edu:16098] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> [dancer.eecs.utk.edu:16098] *** and potentially your MPI job)
> MPITEST_results: MPI_Errhandler_fatal all tests PASSED (4)
> [dancer.eecs.utk.edu:16096] [[24280,0],0]-[[24280,1],3] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
> [dancer.eecs.utk.edu:16096] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
> [dancer.eecs.utk.edu:16096] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
>
> As you can see it is identical to the output in your test.
>
> george.
>
>
> On Aug 18, 2011, at 12:29 , TERRY DONTJE wrote:
>
>> Just ran MPI_Errhandler_fatal_c with r25063 and it still fails. Everything is the same except I don't see the "readv failed.." message.
>>
>> Have your tried to run this code yourself? It is pretty simple and fails with one node using np=4.
>>
>> --td
>>
>> On 8/18/2011 10:57 AM, Wesley Bland wrote:
>>> I just checked in a fix (I hope). I think the problem was that the errmgr
>>> was removing children from the list of odls children without using the
>>> mutex to prevent race conditions. Let me know if the MTT is still having
>>> problems tomorrow.
>>>
>>> Wes
>>>
>>>
>>>> I am seeing the intel test suite tests MPI_Errhandler_fatal_c and
>>>> MPI_Errhandler_fatal_f fail with an oob failure quite a bit I have not
>>>> seen this test failing under MTT until the epoch code was added. So I
>>>> have a suspicion the epoch code might be at fault. Could someone
>>>> familiar with the epoch changes (Wesley) take a look at this failure.
>>>>
>>>> Note this intermittently fails but fails for me more times than not.
>>>> Attached is a log file of a run that succeeds followed by the failing
>>>> run. The piece of concern are the messages involving
>>>> mca_oob_tcp_msg_recv and below.
>>>>
>>>> thanks,
>>>>
>>>> --
>>>> Oracle
>>>> Terry D. Dontje | Principal Software Engineer
>>>> Developer Tools Engineering | +1.781.442.2631
>>>> Oracle *- Performance Technologies*
>>>> 95 Network Drive, Burlington, MA 01803
>>>> Email
>>>> terry.dontje_at_[hidden]<mailto:terry.dontje_at_[hidden]>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>> --
>> <Mail Attachment.gif>
>> Terry D. Dontje | Principal Software Engineer
>> Developer Tools Engineering | +1.781.442.2631
>> Oracle - Performance Technologies
>> 95 Network Drive, Burlington, MA 01803
>> Email terry.dontje_at_[hidden]
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>



picture