On May 29, 2010, at 11:35 AM, Rahul Nabar wrote:
> On Sat, May 29, 2010 at 8:19 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>>> From your other note, it sounds like #3 might be the problem here. Do you have some nodes that are configured with "eth0" pointing to your 10.x network, and other nodes with "eth0" pointing to your 192.x network? I have found that having interfaces that share a name but are on different IP addresses sometimes causes OMPI to miss-connect.
>> If you randomly got some of those nodes in your allocation, that might explain why your jobs sometimes work and sometimes don't.
> That is exactly true. On some nodes eth0 is 1Gig and on others 10Gig
> and vice versa. Is that going to be a problem and is there a
> workaround? I mean 192.168 is always the 10Gig and 10.0 the 1 Gig but
> the correspondence with eth0 vs eth1 is not consistent. I'd have liked
> that but couldn't figure out a way to guarantee the order of the eth
Just set the mca param oob_tcp_if_include 192.168 and you should be okay. I forget the exact param syntax for specifying an IP network instead of an interface name, but you can get it by using
ompi_info --param oob tcp
> users mailing list