Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] 'readv failed: Connection timed out' issue
From: Terry Dontje (terry.dontje_at_[hidden])
Date: 2010-04-20 09:18:33

Hi Jonathan,

Do you know what the top level function is or communication pattern? Is
it some type of collective or a pattern that has a many to one. What
might be happening is that since OMPI uses a lazy connections by default
if all processes are trying to establish communications to the same
process you might run into the below.

You might want to see if setting "--mca mpi_preconnect_all 1" helps any.
But beware this will cause your startup to increase. However, this might
give us insight as to whether the problem is flooding a single rank with
connect requests.


Jonathan Dursi wrote:
> Hi:
> We've got OpenMPI 1.4.1 and Intel MPI running on our 3000 node system. We like OpenMPI for large jobs, because the startup time is much faster (and startup is more reliable) than the current defaults with IntelMPI; but we're having some pretty serious problems when the jobs are actually running. When running medium- to large- sized jobs (say, anything over 500 cores) over ethernet using OpenMPI, several of our users, using a variety of very different sorts of codes, report errors like this:
> [gpc-f102n010][[30331,1],212][btl_tcp_frag.c:214:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection timed out (110)
> which sometimes hang the job, or sometimes kill it outright:
> [gpc-f114n073][[23186,1],109][btl_tcp_frag.c:214:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection timed out (110)
> [gpc-f114n075][[23186,1],125][btl_tcp_frag.c:214:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> mpirun: killing job...
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 9513 on node gpc-f123n025
> exited on signal 0 (Unknown signal 0).
> --------------------------------------------------------------------------
> We don't see this problem when the same users, using the same codes, use IntelMPI.
> Unfortunately, this only happens intermittently, and only with large jobs, so it is hard to track down. It seems to happen more reliably with larger numbers of processors, but I don't know if that tells us something real about the issue, or just that larger N -> better statistics. For one users code, it definitely occurs during an MPI_Wait (this particular code has been run on a wide variety of machines with a wide variety of MPIs -- which isn't proof of correctness of course, but everything looks fine), for others it is less clear. I don't know if it's an OpenMPI issue, or just represents a network issue which Intel's MPI happens to be more tolerant of with the default set of parameters. It's also unclear whether or not this issue occurred with earlier OpenMPI versions.
> Where should I start looking to find out what is going on? Are there parameters that can be adjusted to play with timeouts to see if the issue can be localized, or worked around?
> - Jonathan

Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.650.633.7054
Oracle * - Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>