I doubt that will solve the problem. The issue is that procs are continuing to fail while you are trying to respond to the first one. Here is what happens:
1. first proc fails, causing a "connection failed" error that gets reported to the orted errmgr.
2. errmgr_orted starts trying to send "proc failed" notifications to the remaining procs
3. next proc fails before the rml.send command in #2. OOB sees failure and removes that connection from its hash table.
4. rml.send is issued, and fails because that connection is no longer in the OOB hash table.
So the inherent problem has nothing to do with maintaining coherence in the child list - it has to do with the disconnect between what is happening in the errmgr and the oob.
On Aug 18, 2011, at 8:57 AM, Wesley Bland wrote:
> I just checked in a fix (I hope). I think the problem was that the errmgr
> was removing children from the list of odls children without using the
> mutex to prevent race conditions. Let me know if the MTT is still having
> problems tomorrow.
>> I am seeing the intel test suite tests MPI_Errhandler_fatal_c and
>> MPI_Errhandler_fatal_f fail with an oob failure quite a bit I have not
>> seen this test failing under MTT until the epoch code was added. So I
>> have a suspicion the epoch code might be at fault. Could someone
>> familiar with the epoch changes (Wesley) take a look at this failure.
>> Note this intermittently fails but fails for me more times than not.
>> Attached is a log file of a run that succeeds followed by the failing
>> run. The piece of concern are the messages involving
>> mca_oob_tcp_msg_recv and below.
>> Terry D. Dontje | Principal Software Engineer
>> Developer Tools Engineering | +1.781.442.2631
>> Oracle *- Performance Technologies*
>> 95 Network Drive, Burlington, MA 01803
>> Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>
> devel mailing list