Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Change in btl/tcp
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2008-04-18 13:00:40


I'm seeing this problem as well even running just 4 processes on a
single node (though not as frequently as with higher process counts).
The trick is to force Open MPI to use only tcp,self and nothing else.
Did you try adding this (-mca btl tcp,self) to the runtime parameter
set?

-- Josh

On Apr 18, 2008, at 12:56 PM, Adrian Knoth wrote:

> On Fri, Apr 18, 2008 at 08:04:17AM -0400, Tim Prins wrote:
>
>> Hi Adrian,
>
> Hi!
>
>> After this change, I am getting a lot of errors of the form:
>> [sif2][[12854,1],9][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
>> mca_btl_tcp_frag_recv: readv failed: Connection reset by
>> peer (104)
>>
>> See for instance: http://www.open-mpi.org/mtt/index.php?do_redir=615
>
> That's weird. I've tried hello_c.c on about ten machines with
> different
> network configurations, none of them showed any problems at all.
>
> Do you have a very special setup? And if need be, would it be possible
> to debug on your machine?
>
>
>> From all MTT sites, this error only occurs on Odin and Sif. What's so
> special with these clusters?
>
>> I have found this especially easy to reproduce if I run 16
>> processes all
>> with just the tcp and self btls on the same machine, running the
>> 'hello_c' program in the examples directory.
>
> Unfortunately, I can't reproduce it that way. If this is related to
> the
> change, then it would mean that mca_btl_tcp_proc_accept() returns
> false,
> either after the large loop or in mca_btl_tcp_endpoint_accept().
>
> Do you have the cycles to add some BTL_VERBOSE-lines to see where
> things
> go wrong? Or even to step through with the debugger?
>
> If you want me to do it, I would provide you with my ssh key?
>
>
> Cheerio
>
>
> --
> mail: adi_at_[hidden] http://adi.thur.de PGP/GPG: key via keyserver
>
> Das Sterben wird nur halb so schlimm, rauchst du KIM.
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel