Just ran MPI_Errhandler_fatal_c with r25063 and it still fails.  Everything is the same except I don't see the "readv failed.." message.

Have your tried to run this code yourself?  It is pretty simple and fails with one node using np=4.

--td

On 8/18/2011 10:57 AM, Wesley Bland wrote:
I just checked in a fix (I hope). I think the problem was that the errmgr
was removing children from the list of odls children without using the
mutex to prevent race conditions. Let me know if the MTT is still having
problems tomorrow.

Wes

I am seeing the intel test suite tests MPI_Errhandler_fatal_c and
MPI_Errhandler_fatal_f fail with an oob failure quite a bit  I have not
seen this test failing under MTT until the epoch code was added.  So I
have a suspicion the epoch code might be at fault.  Could someone
familiar with the epoch changes (Wesley) take a look at this failure.

Note this intermittently fails but fails for me more times than not.
Attached is a log file of a run that succeeds followed by the failing
run.  The piece of concern are the messages involving
mca_oob_tcp_msg_recv and below.

thanks,

--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.dontje@oracle.com <mailto:terry.dontje@oracle.com>





    

--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle - Performance Technologies
95 Network Drive, Burlington, MA 01803
Email terry.dontje@oracle.com