I also face this problem "readv failed: connection time out" in the production environment, and our engineer has reproduced this scenario at 20 nodes with gigabye ethernet and limit one ethernet speed to 2MB/s, then a MPI_Isend && MPI_Recv ring that means each node call MPI_Isend send data to the next node and then call MPI_Recv recv data from the prior with large size for many cycles, then we get the following error log: [btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection timed out (110)
Open MPI version 1.3.1, using btl tcp component.
I thought it might because the network fd was set nonblocking, and the nonblocking call of connect() might be error and the epoll_wait() was wake up by the error but treat it as success and call mca_btl_tcp_endpoint_recv_handler(), the nonblocking readv() call on a failed connected fd, so it return -1, and set the errorno to 110 which means connection timed out.
> From: firstname.lastname@example.org > Date: Tue, 20 Apr 2010 09:24:17 -0400 > To: email@example.com > Subject: Re: [OMPI users] 'readv failed: Connection timed out' issue > > On 2010-04-20, at 9:18AM, Terry Dontje wrote: > > > Hi Jonathan, > > > > Do you know what the top level function is or communication pattern? Is it some type of collective or a pattern that has a many to one. > > Ah, should have mentioned. The best-characterized code that we're seeing this with is an absolutely standard (logically) regular grid hydrodynamics code, only does nearest neighbour communication for exchanging guardcells; the Wait in this case is, I think, just a matter of overlapping communication with computation of the inner zones. There are things like allreduces in there, as well, for setting timesteps, but the communication pattern is overall extremely regular and well-behaved. > > > What might be happening is that since OMPI uses a lazy connections by default if all processes are trying to establish communications to the same process you might run into the below. > > > > You might want to see if setting "--mca mpi_preconnect_all 1" helps any. But beware this will cause your startup to increase. However, this might give us insight as to whether the problem is flooding a single rank with connect requests. > > I'm certainly willing to try it. > > - Jonathan > > -- > Jonathan Dursi <firstname.lastname@example.org> > > > > > > _______________________________________________ > users mailing list > email@example.com > http://www.open-mpi.org/mailman/listinfo.cgi/users