Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI daemon error
From: Ralph Castain (rhc_at_[hidden])
Date: 2010-05-29 19:13:10


On May 29, 2010, at 11:35 AM, Rahul Nabar wrote:

> On Sat, May 29, 2010 at 8:19 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>
>>
>>> From your other note, it sounds like #3 might be the problem here. Do you have some nodes that are configured with "eth0" pointing to your 10.x network, and other nodes with "eth0" pointing to your 192.x network? I have found that having interfaces that share a name but are on different IP addresses sometimes causes OMPI to miss-connect.
>>
>> If you randomly got some of those nodes in your allocation, that might explain why your jobs sometimes work and sometimes don't.
>
> That is exactly true. On some nodes eth0 is 1Gig and on others 10Gig
> and vice versa. Is that going to be a problem and is there a
> workaround? I mean 192.168 is always the 10Gig and 10.0 the 1 Gig but
> the correspondence with eth0 vs eth1 is not consistent. I'd have liked
> that but couldn't figure out a way to guarantee the order of the eth
> interfaces.

Just set the mca param oob_tcp_if_include 192.168 and you should be okay. I forget the exact param syntax for specifying an IP network instead of an interface name, but you can get it by using

ompi_info --param oob tcp

>
> --
> Rahul
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users