Just ran MPI_Errhandler_fatal_c with r25063 and it still fails.
Everything is the same except I don't see the "readv failed.." message.
Have your tried to run this code yourself? It is pretty simple and
fails with one node using np=4.
On 8/18/2011 10:57 AM, Wesley Bland wrote:
> I just checked in a fix (I hope). I think the problem was that the errmgr
> was removing children from the list of odls children without using the
> mutex to prevent race conditions. Let me know if the MTT is still having
> problems tomorrow.
>> I am seeing the intel test suite tests MPI_Errhandler_fatal_c and
>> MPI_Errhandler_fatal_f fail with an oob failure quite a bit I have not
>> seen this test failing under MTT until the epoch code was added. So I
>> have a suspicion the epoch code might be at fault. Could someone
>> familiar with the epoch changes (Wesley) take a look at this failure.
>> Note this intermittently fails but fails for me more times than not.
>> Attached is a log file of a run that succeeds followed by the failing
>> run. The piece of concern are the messages involving
>> mca_oob_tcp_msg_recv and below.
>> Terry D. Dontje | Principal Software Engineer
>> Developer Tools Engineering | +1.781.442.2631
>> Oracle *- Performance Technologies*
>> 95 Network Drive, Burlington, MA 01803
>> Email terry.dontje_at_[hidden]<mailto:terry.dontje_at_[hidden]>
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>