I am getting the following error with openmpi-1.1b1
mca_btl_tcp_frag_send: writev failed with errno=110
1) This does not ever happen with other MPI's I have tried like MPICH and
2) It only seems to happen with large numbers of cpus, 32 and occasionally
16, and with larger messages sizes. In this case it ws 128K.
3) It only seems to happen with dual cpus on each node.
4) My configuration is default with (in openmpi-mca-params.conf):
pls_rsh_agent = rsh
btl = tcp,self
btl_tcp_if_include = eth1
I also set --mca btl_tcp_eager_limit 131072 when running the program, though
leaving this out does not eliminate the problem.
My program is a communication test; it sends bidirectional point to point
messages among N cpus. In one test it exchanges messages between pairs of
cpus, in another it reads from the node on its left and sends to the node on
its right (a kind of ring), and in a third it uses MPI_ALL_REDUCE.
Finally: the tcp driver in openmpi seems not nearly as good as the one in
LAM. I got higher throughput with far fewer dropouts with LAM.
Professor, Chemical Engineering
University of Florida
PO Box 116005
Gainesville, FL 32611-6005