On Sat, May 29, 2010 at 8:19 AM, Ralph Castain <rhc_at_[hidden]> wrote:
> >From your other note, it sounds like #3 might be the problem here. Do you have some nodes that are configured with "eth0" pointing to your 10.x network, and other nodes with "eth0" pointing to your 192.x network? I have found that having interfaces that share a name but are on different IP addresses sometimes causes OMPI to miss-connect.
> If you randomly got some of those nodes in your allocation, that might explain why your jobs sometimes work and sometimes don't.
That is exactly true. On some nodes eth0 is 1Gig and on others 10Gig
and vice versa. Is that going to be a problem and is there a
workaround? I mean 192.168 is always the 10Gig and 10.0 the 1 Gig but
the correspondence with eth0 vs eth1 is not consistent. I'd have liked
that but couldn't figure out a way to guarantee the order of the eth