Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Subnet routing (1.2.x) not working in 1.4.3 anymore
From: Mirco Wahab (mirco.wahab_at_[hidden])
Date: 2011-10-25 16:15:12

In the last few years, it has been very simple to
set up high-performance (GbE) multiple back-to-back
connections between three nodes (triangular topology)
or four nodes (tetrahedral topology).

The only things you had to do was
- use 3 (or 4) cheap compute nodes w/Linux and connect
   each of them via standard GbE router (onboard GbE NIC)
   to a file server,
- put 2 (trigonal topol.) or 3 (tetrahedral topol.)
   $25 PCIe-GbE-NICs into *each* node,
- connect the nodes with 3 (trigonal) or 4 (tetrahedral)
   short crossover Cat5e cables,
- configure the extra NICs into different subnets
   according to their "edge index", eg.
   for 3 nodes (node10, node11, node12)
       onboard NIC: on eth0 (to router/server)
       extra NIC: on eth1 (edge 1 to
       extra NIC: on eth2 (edge 2 to
       onboard NIC: on eth0 (to router/server)
       extra NIC: on eth1 (edge 1 to
       extra NIC: on eth3 (edge 3 to
       onboard NIC: on eth0 (to router/server)
       extra NIC: on eth2 (edge 2 to
       extra NIC: on eth3 (edge 3 to
- that's it. I mean, that *was* it, with 1.2.x.

OMPI 1.2.x would then ingeniously discover the routable edges
and open communication ports accordingly without any additional
explicit host routing, eg. invoked by

$> mpirun -np 12 --host c10,c11,c12 --mca btl_tcp_if_exclude lo,eth0 my_mpi_app

and (measured by iftop) saturate the available edges with
about 100MB/sec duplex on each of them. It would not stumble
on the fact, that some interfaces are not reacheable by
every NIC directly. And this was very convenient over the years.

With 1.4.3 (which comes out of the box) w/actual Linux distributions,
this won't work. It would hang and complain after timeout about failed
endpoint connects, eg:

[node12][[52378,1],2][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to failed: Connection timed out (110)

* Can the intelligent behaviour of 1.2.x be "configured back"?

* How should the topology look like to work with 1,4,x painlessly?

Thanks & regards