On 5/4/2012 8:26 AM, Rolf vandeVaart wrote:

2. If that works, then you can also run with a debug switch to see
what connections are being made by MPI.
You can see the connections being made in the attached log:

[archimedes:29820] btl: tcp: attempting to connect() to [[60576,1],2] address
138.23.141.162 on port 2001
Yes, I missed that.  So, can we simplify the problem.  Can you run with np=2 and one process on each node?
Also, maybe you can send the ifconfig output from each node.  We sometimes see this type of hanging when
a node has two different interfaces on the same subnet.  

Assuming there are multiple interfaces, can you experiment with the runtime flags outlined here?
http://www.open-mpi.org/faq/?category=tcp#tcp-selection

Maybe by restricting to specific interfaces you can figure out which network is the problem.

Another cause of tcp hangs, if you are on linux, is if the virbr0 interfaces are configured.  The tcp btl will incorrectly think that it can use the virbr interfaces to communicate with other nodes.  You either need to disable the virbr interfaces or exclude them from being used by the tcp btl.

--
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle - Performance Technologies
95 Network Drive, Burlington, MA 01803
Email terry.dontje@oracle.com