Joe Landman wrote:
> 3) using btl to turn off sm and openib, generates lots of these messages:
> connect() failed with errno=113
> No route to host at -e line 1.
> This is wrong, all the nodes are visible from all the other nodes on a
> private subnet. For example:
ok, fixed this. Turns out we have ipoib going, and one adapter needed
to be brought down and back up. Now the tcp version appears to be
running, though I do get the strange hangs after a random (never the
same) number of iterations.
Given that the hangs are random, and don't appear to happen at the same
time step but a similar place in the code, suggests to me that something
may be amiss in the MPI_Waitsome function. Possible a completion was
posted and due to buffer sizes, fell off the scoreboard.
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
web : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax : +1 866 888 3112
cell : +1 734 612 4615