After this change, I am getting a lot of errors of the form:
mca_btl_tcp_frag_recv: readv failed: Connection reset by
See for instance: http://www.open-mpi.org/mtt/index.php?do_redir=615
I have found this especially easy to reproduce if I run 16 processes all
with just the tcp and self btls on the same machine, running the
'hello_c' program in the examples directory.
Adrian Knoth wrote:
> As of r18169, I've changed the acceptance rules for incoming BTL-TCP
> The old code would have denied a connection in case of non-matching
> addresses (comparison between source address and expected source
> Unfortunately, you cannot always say which source address an incoming
> packet will have (it's the sender's kernel who decides), so rejecting a
> connection due to "wrong" source address caused a complete hang.
> I had several cases, mostly multi-cluster setups, where this has happend
> all the time. (typical scenario: you're expecting the headnode's
> internal address, but since you're talking to another cluster,
> the kernel uses the headnode's external address)
> Though I've tested it as much as possible, I don't know if it breaks
> your setup, especially the multi-rail stuff. George?