Hi! 
  I'm using OpenMPI 1.3 on 30 nodes connected with Gigabit Ethernet on Redhat Linux x86_64. 

 
Our MPI job sometimes hang and show follow error logs:
 
 [btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv  failed: Connection timed out (110)
 
I run a test like this: write a hello world program, send "helloworld" from rank 0 to rank 1,  and modified the recv() return value at btl_tcp_frag.c:mca_btl_tcp_frag_recv() , force the readv return value cnt equals to -1, and rebuild openmpi and change the dynamic libs, then run the helloworld, the MPI job hang at MPI_Recv().
 
I have the follow questions:
         
     Does OpenMPI support check the btl tcp network error, such as readv or writev failed ? I found mca_btl_tcp_endpoint_recv_handler() at btl layer couldn't return the error stat to PML, how could I made it?
 
how could MPI_Send, MPI_Isend, MPI_Recv, MPI_Irecv detect those error and avoid hang ?
 
 
thanks a lot!
 



使用Messenger保护盾2.0,支持多账号登录! 现在就下载!