Hi! I'm using OpenMPI 1.3 on 30 nodes connected with Gigabit Ethernet on Redhat Linux x86_64.
Our MPI job sometimes hang and show follow error logs:
[btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection timed out (110)
I run a test like this: write a hello world program, send "helloworld" from rank 0 to rank 1, and modified the recv() return value at btl_tcp_frag.c:mca_btl_tcp_frag_recv() , force the readv return value cnt equals to -1, and rebuild openmpi and change the dynamic libs, then run the helloworld, the MPI job hang at MPI_Recv().
I have the follow questions: Does OpenMPI support check the btl tcp network error, such as readv or writev failed ? I found mca_btl_tcp_endpoint_recv_handler() at btl layer couldn't return the error stat to PML, how could I made it?
how could MPI_Send, MPI_Isend, MPI_Recv, MPI_Irecv detect those error and avoid hang ?