Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Help tracing casue of readv errors
From: Atle Rudshaug (atle_at_[hidden])
Date: 2009-11-25 06:36:01


Pacey, Mike wrote:
> One my users recently reported random hangs of his OpenMPI application.
> I've run some tests using multiple 2-node 16-core runs of the IMB
> benchmark and can occasionally replicate the problem. Looking through
> the mail archive, a previous occurrence of this error seems to been
> suspect code, but as it's IMB failing here, I suspect the problem lies
> elsewhere. The full set of errors generated by a failed run are:
>
> [lancs2-015][[37376,1],2][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Conne
> ction reset by peer (104)
> [lancs2-015][[37376,1],6][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Conne
> ction reset by peer (104)
> [lancs2-015][[37376,1],8][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> [lancs2-015][[37376,1],14][btl_tcp_frag.c:
> 216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed:
> Connection reset by peer (104)
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [lancs2-015][[37376,1],14][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Conn
> ection reset by peer (104)
> [lancs2-015][[37376,1],4][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Conne
> ction reset by peer (104)
> [lancs2-015][[37376,1],4][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Conne
> ction reset by peer (104)
> [lancs2-015][[37376,1],2][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Conne
> ction reset by peer (104)
> [lancs2-015][[37376,1],6][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Conne
> ction reset by peer (104)
> [lancs2-015][[37376,1],0][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Conne
> ction reset by peer (104)
> [lancs2-015][[37376,1],12][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Conn
> ection reset by peer (104)
> [lancs2-015][[37376,1],4][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Conne
> ction reset by peer (104)
> [lancs2-015][[37376,1],12][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Conn
> ection reset by peer (104)
> [lancs2-015][[37376,1],2][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Conne
> ction reset by peer (104)
> [lancs2-015][[37376,1],10][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Conn
> ection reset by peer (104)
> [lancs2-015][[37376,1],8][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Conne
> ction reset by peer (104)
> [lancs2-015][[37376,1],6][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Conne
> ction reset by peer (104)
>
> I'm used to OpenMPI terminating cleanly, but that's not happening in
> this case. All the OpenMPI processes on one node terminate, while the
> processes on the other simply spin with 100% CPU utilisation. I've run
> this 2-node test a number of times, and I'm not seeing any pattern (ie,
> I can't pin it down to a single node - a subsequent run using the two
> nodes involved above ran fine).
>
> Can anyone provide any pointers in tracking down this problem? System
> details as follows:
>
> - OpenMPI 1.3.3, compiled with gcc version 4.1.2 20080704 (Red Hat
> 4.1.2-44), using only the -prefix and -with-sge options.
> - OS is Scientific Linux SL release 5.3
> - CPUs are 2.3GHz Opteron 2356
>
> Regards,
> Mike.
>
> -----
>
> Dr Mike Pacey, Email: M.Pacey_at_[hidden]
> High Performance Systems Support, Phone: 01524 593543
> Information Systems Services, Fax: 01524 594459
> Lancaster University,
> Lancaster LA1 4YW
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
I got a similar error when using non-blocking communication on large
datasets. I could not figure out why this was happening, since it seemed
sort of random. I eventually bypassed the problem by switching to
blocking communication, which felt kind of sad...If anyone knows if this
is a bug in OpenMPI or connected to hardware somehow, please share.

- Atle