Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Change in btl/tcp
From: Tim Prins (tprins_at_[hidden])
Date: 2008-04-18 08:04:17


Hi Adrian,

After this change, I am getting a lot of errors of the form:
[sif2][[12854,1],9][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by
peer (104)

See for instance: http://www.open-mpi.org/mtt/index.php?do_redir=615

I have found this especially easy to reproduce if I run 16 processes all
with just the tcp and self btls on the same machine, running the
'hello_c' program in the examples directory.

Tim

Adrian Knoth wrote:
> Hi!
>
> As of r18169, I've changed the acceptance rules for incoming BTL-TCP
> connections.
>
> The old code would have denied a connection in case of non-matching
> addresses (comparison between source address and expected source
> address).
>
> Unfortunately, you cannot always say which source address an incoming
> packet will have (it's the sender's kernel who decides), so rejecting a
> connection due to "wrong" source address caused a complete hang.
>
> I had several cases, mostly multi-cluster setups, where this has happend
> all the time. (typical scenario: you're expecting the headnode's
> internal address, but since you're talking to another cluster,
> the kernel uses the headnode's external address)
>
> Though I've tested it as much as possible, I don't know if it breaks
> your setup, especially the multi-rail stuff. George?
>
>
> Cheerio
>