Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] MPI_Recv hang because readv failed at mca_btl_tcp_frag_recv()
From: Guanyinzhu (buptzhugy_at_[hidden])
Date: 2010-05-05 06:43:02


Hi!
  I'm using OpenMPI 1.3 on 30 nodes connected with Gigabit Ethernet on Redhat Linux x86_64.
 
Our MPI job sometimes hang and show follow error logs:

 

 [btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection timed out (110)

 

I run a test like this: write a hello world program, send "helloworld" from rank 0 to rank 1, and modified the recv() return value at btl_tcp_frag.c:mca_btl_tcp_frag_recv() , force the readv return value cnt equals to -1, and rebuild openmpi and change the dynamic libs, then run the helloworld, the MPI job hang at MPI_Recv().
 
I have the follow questions:
         
     Does OpenMPI support check the btl tcp network error, such as readv or writev failed ? I found mca_btl_tcp_endpoint_recv_handler() at btl layer couldn't return the error stat to PML, how could I made it?

 

how could MPI_Send, MPI_Isend, MPI_Recv, MPI_Irecv detect those error and avoid hang ?
 
 
thanks a lot!
 

                                               
_________________________________________________________________
Ò»ÕÅÕÕƬµÄ×԰סª¡ªWindows LiveÕÕƬµÄ¿É°®ÊÓƵ½éÉÜ
http://windowslivesky.spaces.live.com/blog/cns!5892B6048E2498BD!889.entry