Ralph and I talked about this issue this afternoon. We're still struggling to understand the details of your configuration, in part because this thread was hijacked twice with issues unrelated to this 127.0.1.1 issue. Here's what we think we know (I'm using the name "foo" instead of your actual hostname because it's easier to type):
1. When you run "hostname", you get foo.local back
2. In your /etc/hosts file, foo.local is listed on two lines:
3. When you login to the "foo" server and execute mpirun with a hostfile that contains "foo", Open MPI incorrectly thinks that the local machine is not foo, and therefore tries to ssh to it (and things go downhill from there).
4. When you login to the "foo" server and execute mpirun with a hostfile that contains "foo.local" (you said "FQDN", but never said exactly what you meant by that -- I'm assuming "foo.local", not "foo.yourdomain.com"), then Open MPI behaves properly.
Is that all correct?
We have some followup questions for you:
1. What happens when you try to resolve "foo"? (e.g., via the "dig" program -- "dig foo")
2. What happens when you try to resolve "foo.local"? (e.g., "dig foo.local")
3. What happens when you try to resolve "foo.yourdomain.com"? (e.g., "dig foo.yourdomain.com")
4. Please apply the attached patch to your Open MPI 1.6.5 build (please note that it adds diagnostic output; do *not* put this patch into production) and:
4a. Run with one of your "bad" cases and send us the output
4b. Run with one of your "good" cases and send us the output
On Jun 26, 2013, at 7:38 AM, Ralph Castain <rhc_at_[hidden]> wrote:
> The root cause of the problem is that you are assigning your host name to the loopback device. This is rather unusual, but not forbidden. Normally, people would name that interface something like "localhost" since it cannot be used to communicate off-node.
> Doing it the way you have could cause problems for you as programs that do a lookup to communicate will get the loopback address when they might have expected something else. Still, we should handle this case.
> I'll see what we can do
> On Wed, Jun 26, 2013 at 2:26 AM, Riccardo Murri <riccardo.murri_at_[hidden]> wrote:
> On 26 June 2013 03:11, Ralph Castain <rhc_at_[hidden]> wrote:
> > I've been reviewing the code, and I think I'm getting a handle on
> > the issue.
> > Just to be clear - your hostname resolves to the 127 address? And you are on
> > a Linux (not one of the BSD flavors out there)?
> Yes (but resolves to 127.0.1.1 -- not the usual 127.0.0.1), and yes
> (Rocks 5.3 ~= CentOS 5.3).
> > If the answer to both is "yes", then the problem is that we ignore loopback
> > devices if anything else is present. When we check to see if the hostname we
> > were given is the local node, we resolve the name to the address and then
> > check our list of interfaces. The loopback device is ignored and therefore
> > not on the list. So if you resolve to the 127 address, we will decide this
> > is a different node than the one we are on.
> > I can modify that logic, but want to ensure this accurately captures the
> > problem. I'll also have to discuss the change with the other developers to
> > ensure we don't shoot ourselves in the foot if we make it.
> Ok, thanks -- I'll keep an eye on your replies.
> users mailing list
> users mailing list
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/