On 5/4/2012 8:26 AM, Rolf vandeVaart wrote:
Another cause of tcp hangs, if you are on linux, is if the virbr0
interfaces are configured. The tcp btl will incorrectly think that
it can use the virbr interfaces to communicate with other nodes.
You either need to disable the virbr interfaces or exclude them from
being used by the tcp btl.
2. If that works, then you can also run with a debug switch to see
what connections are being made by MPI.
You can see the connections being made in the attached log:
[archimedes:29820] btl: tcp: attempting to connect() to [[60576,1],2] address
18.104.22.168 on port 2001
Yes, I missed that. So, can we simplify the problem. Can you run with np=2 and one process on each node?
Also, maybe you can send the ifconfig output from each node. We sometimes see this type of hanging when
a node has two different interfaces on the same subnet.
Assuming there are multiple interfaces, can you experiment with the runtime flags outlined here?
Maybe by restricting to specific interfaces you can figure out which network is the problem.
Terry D. Dontje | Principal
Engineering | +1.781.442.2631
95 Network Drive, Burlington, MA 01803