Dear Open MPI developers,
I'm using Open MPI 1.2.2 over OFED 1.2 on an 256 nodes, dual Opteron,
dual core, Linux cluster. Of course, with Infiniband 4x interconnect.
Each cluster node is equipped with 4 (or more) ethernet interface,
namely 2 gigabit ones plus 2 IPoIB. The two gig are named eth0,eth1,
while the two IPoIB are named ib0,ib1.
It happens that the eth0 is a management network, with poor
performances, and furthermore we wouldn't use the ib* to carry MPI's
traffic (neither OOB or TCP), so we would like the eth1 is used for open
MPI OOB and TCP.
In order to drive the OOB over only eth1 I've tried various combinations
of oob_tcp_[ex|in]clude MCA statements, starting from the obvious
oob_tcp_exclude = lo,eth0,ib0,ib1
then trying the othe obvious:
oob_tcp_include = eth1
and both at the same time.
Next I've tried the following:
oob_tcp_exclude = eth0
but after the job starts, I still have a lot of tcp connections
established using eth0 or ib0 or ib1.
Furthermore It happens the following error:
[node191:03976] [0,1,14]-[0,1,12] mca_oob_tcp_peer_complete_connect:
connection failed: Connection timed out (110) - retrying
I've found only a way in order to have tcp connections binded only to
the eth1 interface, using both the following MCA directives in the
command line:
mpirun .... --mca oob_tcp_include eth1 --mca oob_tcp_include lo,eth0,ib0,ib1 .....
This sounds me as bug.
Is there someone able to reproduce this behaviour?
If this is a bug, are there fixes?
Thanks.
Marco
--
-----------------------------------------------------------------
Marco Sbrighi m.sbrighi_at_[hidden]
HPC Group
CINECA Interuniversity Computing Centre
via Magnanelli, 6/3
40033 Casalecchio di Reno (Bo) ITALY
tel. 051 6171516
|