Thanks for your help - your clue solved my problem!
The ultimate solution was to limit mpi communications to the local,
unrouted subnet. I made this the default behavior of all users of my
cluster by adding the following line to the bottom of my
btl_tcp_if_include = 10.0.0.0/8
Thanks again - what a relief!
On Fri, Jul 5, 2013, at 01:25 AM, Gustavo Correa wrote:
> Hi Jed
> You could try to select only ethernet interface that match your node's IP
> which seems to be en2.
> The en1 interface seems to be an external IP.
> Not sure about en3, but it is awkward that it has a
> different IP than en2, but in the same subnet.
> I wonder if this may be the reason for the program hanging.
> You may need to search all nodes ifconfig for a consistent set of
> interfaces/IP addresses,
> and tailor your mpiexec command line and your hostfile accordingly.
> Say, something like this:
> mpiexec -mca btl_tcp_if_include en2 -hostfile your_hostfile -np 43
> See this FAQ (actually, all of them are very informative):
> I hope this helps,
> Gus Correa