Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Yuan Wan (ywan_at_[hidden])
Date: 2007-06-26 10:34:18


Hi all,

I'm benchmarking our new cluster with HPL. I pick OpenMPI as parallel
environment as I found OpenMPi is able to benefit from two giga-ethernet
tcp
networks on our cluster during low-level benchmark.
(bandwidth could be upto 250MB/s)

The HPL code is well built and run well for small problem size.
However, when I turned to run the code on 32-node (128-way), the code will
crash in the half way with the following error message:

---------------------------------------------
[node074:09973] mca_btl_tcp_frag_send: writev failed with errno=104
[node074:09973] mca_btl_tcp_frag_send: writev failed with errno=104
[node073:10234] mca_btl_tcp_frag_send: writev failed with errno=104
[node073:10234] mca_btl_tcp_frag_send: writev failed with errno=104
[node089:29190] mca_btl_tcp_frag_send: writev failed with errno=104
[node090:27881] mca_btl_tcp_frag_send: writev failed with errno=104
[node072:02729] mca_btl_tcp_frag_send: writev failed with errno=104
[node071:03029] mca_btl_tcp_frag_send: writev failed with errno=104
.....
[node084:06044] mca_btl_tcp_frag_send: writev failed with errno=104
[node086:01346] mca_btl_tcp_frag_send: writev failed with errno=104
[node069:16372] mca_btl_tcp_frag_send: writev failed with errno=104
[node100:23294] mca_btl_tcp_frag_send: writev failed with errno=104
[node069:16372] mca_btl_tcp_frag_send: writev failed with errno=104
[node085:04347] mca_btl_tcp_frag_send: writev failed with errno=104
[node087:31391] mca_btl_tcp_frag_send: writev failed with errno=104
---------------------------------------------

According to the following faq instruction, I explicitly tell the
interface name of tow tcp networks, but the code still break.

mpirun --mca btl_tcp_if_include eth0,eth1 -np 128 -bynode -machinefile
hostfile ./xhpl

http://icl.cs.utk.edu/open-mpi/faq/?category=tcp#tcp-selection

If I include only one tcp network, the code won't break, but the
performance is not desirble/

Anyone know how to fix it?

--Yuan

Yuan Wan
---
Unix Section
Information Services Infrastructure Division
University of Edinburgh

tel: 0131 650 4985
email: ywan_at_[hidden]

2032 Computing Services, JCMB
The King's Buildings,
Edinburgh, EH9 3JZ