Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Bug in oob_tcp_[in|ex]clude?
From: Marco Sbrighi (m.sbrighi_at_[hidden])
Date: 2007-12-17 08:35:27


Dear Open MPI developers,

I'm using Open MPI 1.2.2 over OFED 1.2 on an 256 nodes, dual Opteron,
dual core, Linux cluster. Of course, with Infiniband 4x interconnect.

Each cluster node is equipped with 4 (or more) ethernet interface,
namely 2 gigabit ones plus 2 IPoIB. The two gig are named eth0,eth1,
while the two IPoIB are named ib0,ib1.

It happens that the eth0 is a management network, with poor
performances, and furthermore we wouldn't use the ib* to carry MPI's
traffic (neither OOB or TCP), so we would like the eth1 is used for open
MPI OOB and TCP.

In order to drive the OOB over only eth1 I've tried various combinations
of oob_tcp_[ex|in]clude MCA statements, starting from the obvious
 
oob_tcp_exclude = lo,eth0,ib0,ib1

then trying the othe obvious:

oob_tcp_include = eth1

and both at the same time.

Next I've tried the following:

oob_tcp_exclude = eth0

but after the job starts, I still have a lot of tcp connections
established using eth0 or ib0 or ib1.
Furthermore It happens the following error:

   [node191:03976] [0,1,14]-[0,1,12] mca_oob_tcp_peer_complete_connect:
connection failed: Connection timed out (110) - retrying

I've found only a way in order to have tcp connections binded only to
the eth1 interface, using both the following MCA directives in the
command line:

mpirun .... --mca oob_tcp_include eth1 --mca oob_tcp_include lo,eth0,ib0,ib1 .....

This sounds me as bug.

Is there someone able to reproduce this behaviour?
If this is a bug, are there fixes?

Thanks.

Marco
 

-- 
-----------------------------------------------------------------
 Marco Sbrighi  m.sbrighi_at_[hidden]
 HPC Group
 CINECA Interuniversity Computing Centre
 via Magnanelli, 6/3
 40033 Casalecchio di Reno (Bo) ITALY
 tel. 051 6171516