Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] MPI_Errhandler_fatal_c failure
From: TERRY DONTJE (terry.dontje_at_[hidden])
Date: 2011-08-18 14:58:29


On 8/18/2011 2:32 PM, George Bosilca wrote:
> Terry,
>
> The test succeeded in both of your runs.
Not really. Granted the test aborted in both cases however the case
you show below has further issues while the orte is trying to clean
things up. It certainly is not what I would call friendly. But that is
besides the point, the issue is orte is having issues with
MPI_Errhandler_fatal_c test IMO and it looks like you have seen the same
failure prior to the epoch changes. Fair enough, I'll go back to the
drawing board and see if I can narrow this down.

--td
> However, I rolled back before the epoch change (24814) and the output is the following:
>
> MPITEST info (0): Starting MPI_Errhandler_fatal test
> MPITEST info (0): This test will abort after printing the results message
> MPITEST info (0): If it does not, then a f.a.i.l.u.r.e will be noted
> [dancer.eecs.utk.edu:16098] *** An error occurred in MPI_Send
> [dancer.eecs.utk.edu:16098] *** reported by process [766095392769,139869904961537]
> [dancer.eecs.utk.edu:16098] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
> [dancer.eecs.utk.edu:16098] *** MPI_ERR_RANK: invalid rank
> [dancer.eecs.utk.edu:16098] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> [dancer.eecs.utk.edu:16098] *** and potentially your MPI job)
> MPITEST_results: MPI_Errhandler_fatal all tests PASSED (4)
> [dancer.eecs.utk.edu:16096] [[24280,0],0]-[[24280,1],3] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
> [dancer.eecs.utk.edu:16096] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
> [dancer.eecs.utk.edu:16096] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
>
> As you can see it is identical to the output in your test.
>
> george.
>
>
> On Aug 18, 2011, at 12:29 , TERRY DONTJE wrote:
>
>> Just ran MPI_Errhandler_fatal_c with r25063 and it still fails. Everything is the same except I don't see the "readv failed.." message.
>>
>> Have your tried to run this code yourself? It is pretty simple and fails with one node using np=4.
>>
>> --td
>>
>> On 8/18/2011 10:57 AM, Wesley Bland wrote:
>>> I just checked in a fix (I hope). I think the problem was that the errmgr
>>> was removing children from the list of odls children without using the
>>> mutex to prevent race conditions. Let me know if the MTT is still having
>>> problems tomorrow.
>>>
>>> Wes
>>>
>>>
>>>> I am seeing the intel test suite tests MPI_Errhandler_fatal_c and
>>>> MPI_Errhandler_fatal_f fail with an oob failure quite a bit I have not
>>>> seen this test failing under MTT until the epoch code was added. So I
>>>> have a suspicion the epoch code might be at fault. Could someone
>>>> familiar with the epoch changes (Wesley) take a look at this failure.
>>>>
>>>> Note this intermittently fails but fails for me more times than not.
>>>> Attached is a log file of a run that succeeds followed by the failing
>>>> run. The piece of concern are the messages involving
>>>> mca_oob_tcp_msg_recv and below.
>>>>
>>>> thanks,
>>>>
>>>> --
>>>> Oracle
>>>> Terry D. Dontje | Principal Software Engineer
>>>> Developer Tools Engineering | +1.781.442.2631
>>>> Oracle *- Performance Technologies*
>>>> 95 Network Drive, Burlington, MA 01803
>>>> Email
>>>> terry.dontje_at_[hidden]<mailto:terry.dontje_at_[hidden]>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>> --
>> <Mail Attachment.gif>
>> Terry D. Dontje | Principal Software Engineer
>> Developer Tools Engineering | +1.781.442.2631
>> Oracle - Performance Technologies
>> 95 Network Drive, Burlington, MA 01803
>> Email terry.dontje_at_[hidden]
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>



picture