Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Change in btl/tcp
From: Adrian Knoth (adi_at_[hidden])
Date: 2008-04-18 12:56:05


On Fri, Apr 18, 2008 at 08:04:17AM -0400, Tim Prins wrote:

> Hi Adrian,

Hi!

> After this change, I am getting a lot of errors of the form:
> [sif2][[12854,1],9][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by
> peer (104)
>
> See for instance: http://www.open-mpi.org/mtt/index.php?do_redir=615

That's weird. I've tried hello_c.c on about ten machines with different
network configurations, none of them showed any problems at all.

Do you have a very special setup? And if need be, would it be possible
to debug on your machine?

>From all MTT sites, this error only occurs on Odin and Sif. What's so
special with these clusters?

> I have found this especially easy to reproduce if I run 16 processes all
> with just the tcp and self btls on the same machine, running the
> 'hello_c' program in the examples directory.

Unfortunately, I can't reproduce it that way. If this is related to the
change, then it would mean that mca_btl_tcp_proc_accept() returns false,
either after the large loop or in mca_btl_tcp_endpoint_accept().

Do you have the cycles to add some BTL_VERBOSE-lines to see where things
go wrong? Or even to step through with the debugger?

If you want me to do it, I would provide you with my ssh key?

Cheerio

-- 
mail: adi_at_[hidden]  	http://adi.thur.de	PGP/GPG: key via keyserver
Das Sterben wird nur halb so schlimm, rauchst du KIM.