Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] TCP btl misbehaves if btl_tcp_port_min_v4 is not set.
From: Eric Thibodeau (kyron_at_[hidden])
Date: 2009-07-23 14:48:34


Hello all,

   (this _might_ be related to https://svn.open-mpi.org/trac/ompi/ticket/1505)

   I just compiled and installed 1.3.3 ins a CentOS 5 environment and we noticed the
processes would deadlock as soon as they would start using TCP communications. The
test program is one that has been running on other clusters for years with no
problems. Furthermore, using local cores doesn't deadlock the process whereas forcing
inter-node communications (-bynode scheduling), immediately causes the problem.

Symptoms:
- processes don't crash or die, the use 100% CPU in system space (as opposed to user space)
- stracing one of the processes will show it is freewheeling in a polling loop.
- executing with --mca btl_base_verbose 30 will show weird port assignments, either they
are wrong or should be interpreted as being an offset from the default
btl_tcp_port_min_v4 (1024).
- The error "mca_btl_tcp_endpoint_complete_connect] connect() to <IP ADDR> failed: No
route to host (113)" _may_ be seen. We noticed it only showed up if we had vmnet
interfaces up and running on certain nodes. Note that setting

 oob_tcp_listen_mode=listen_thread
 oob_tcp_if_include=eth0
 btl_tcp_if_include=eth0

was one of our first reaction to this to no avail.

Workaround we found:

While keeping the above mentioned MCA parameters, we added btl_tcp_port_min_v4=2000 due
to some firewall rules (which we had obviously disabled as part of the trouble shooting
process) and noticed everything seemed to start working correctly from here on.

This seems to work but I can find no logical explanation as the code seems to be clean
in that respect.

Some pasting for people searching frantically for a solution:

[cluster-srv2:20379] btl: tcp: attempting to connect() to address 10.194.32.113 on port
2052
[cluster-srv2:20381] btl: tcp: attempting to connect() to address 10.194.32.113 on port
3076
[cluster-srv2:20377] btl: tcp: attempting to connect() to address 10.194.32.113 on port 260
[cluster-srv2:20383] btl: tcp: attempting to connect() to address 10.194.32.113 on port
3588
[cluster-srv1:19900] btl: tcp: attempting to connect() to address 10.194.32.117 on port
1540
[cluster-srv2:20377] btl: tcp: attempting to connect() to address 10.194.32.117 on port
2052
[cluster-srv2:20383] btl: tcp: attempting to connect() to address 10.194.32.117 on port
3076
[cluster-srv1:19894] btl: tcp: attempting to connect() to address 10.194.32.117 on port 516
[cluster-srv2:20379] btl: tcp: attempting to connect() to address 10.194.32.117 on port
3588
[cluster-srv1:19898] btl: tcp: attempting to connect() to address 10.194.32.117 on port
1028
[cluster-srv2:20381] btl: tcp: attempting to connect() to address 10.194.32.117 on port
2564
[cluster-srv1:19896] btl: tcp: attempting to connect() to address 10.194.32.117 on port 4
[cluster-srv3:13665] btl: tcp: attempting to connect() to address 10.194.32.115 on port
1028
[cluster-srv3:13663] btl: tcp: attempting to connect() to address 10.194.32.115 on port 4
[cluster-srv2][[44096,1],9][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
[cluster-srv2][[44096,1],13][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
connect() to 10.194.32.117 failed: No route to host (113)
connect() to 10.194.32.117 failed: No route to host (113)
[cluster-srv3][[44096,1],20][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
connect() to 10.194.32.115 failed: No route to host (113)

Cheers!

Eric Thiboedau