Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi 1.6.3 fails to identify local host if its IP is 127.0.1.1
From: Jeff Squyres (jsquyres) (jsquyres_at_[hidden])
Date: 2013-07-03 15:09:46


Ralph and I talked some more about this.

Here's what we think:

1. The root cause of the issue is that you are assigning a non-existent IP address to a name. I.e., <foo> maps to 127.0.1.1, but that IP address does not exist anywhere. Hence, OMPI will never conclude that that <foo> is "local". If you had assigned <foo> to the 127.0.0.1 address, things should have worked fine.

Just curious: why are you doing this?

2. That being said, OMPI is not currently looking at all the responses from gethostbyname() -- we're only looking at the first one. In the spirit of how clients are supposed to behave when multiple IP addresses are returned from a single name lookup, OMPI should examine all of those addresses and see if it finds one that it "likes", and then use that. So we should extend OMPI to examine all the IP addresses from gethostbyname(). This should also fix your issue.

Ralph is going to work on this, but it'll likely take him a little time to get it done. We'll get it into the trunk and probably ask you to verify that it works for you. And if so, we'll back-port to the v1.6 and v1.7 series.

One final caveat, however: at this point, it does not look likely that 1.6.6 will ever happen. If this all works out, the fix will be committed to the v1.6 tree, and you can grab a nightly tarball snapshot (which are identical to our release tarballs except for their version numbers), or you can patch your 1.6.5 installation. But if 1.6.6 is ever released, the fix will be included.

On Jul 2, 2013, at 9:53 AM, Riccardo Murri <riccardo.murri_at_[hidden]> wrote:

> Hi,
>
> sorry for the delay in replying -- pretty busy week :-(
>
>
> On 28 June 2013 21:54, Jeff Squyres (jsquyres) <jsquyres_at_[hidden]> wrote:
>> Here's what we think we know (I'm using the name "foo" instead of
>> your actual hostname because it's easier to type):
>>
>> 1. When you run "hostname", you get foo.local back
>
> Yes.
>
>
>> 2. In your /etc/hosts file, foo.local is listed on two lines:
>> 127.0.1.1
>> 10.1.255.201
>>
>
> Yes:
>
> [rmurri_at_nh64-5-9 ~]$ fgrep nh64-5-9 /etc/hosts
> 127.0.1.1 nh64-5-9.local nh64-5-9
> 10.1.255.194 nh64-5-9.local nh64-5-9
>
>
>> 3. When you login to the "foo" server and execute mpirun with a hostfile
>> that contains "foo", Open MPI incorrectly thinks that the local machine is
>> not foo, and therefore tries to ssh to it (and things go downhill from
>> there).
>>
>
> Yes.
>
>
>> 4. When you login to the "foo" server and execute mpirun with a hostfile
>> that contains "foo.local" (you said "FQDN", but never said exactly what you
>> meant by that -- I'm assuming "foo.local", not "foo.yourdomain.com"), then
>> Open MPI behaves properly.
>>
>
> Yes.
>
> FQDN = foo.local. (This is a compute node in a cluster that does not
> have any public IP address not DNS entry -- it only has an interface
> to the cluster-private network. I presume this is not relevant to
> OpenMPI as long as all names are correctly resolved via `/etc/hosts`.)
>
>
>> Is that all correct?
>
> Yes, all correct.
>
>
>> We have some followup questions for you:
>>
>> 1. What happens when you try to resolve "foo"? (e.g., via the "dig" program
>> -- "dig foo")
>
> Here's what happens with `dig`:
>
> [rmurri_at_nh64-5-9 ~]$ dig nh64-5-9
>
> ; <<>> DiG 9.3.6-P1-RedHat-9.3.6-4.P1.el5 <<>> nh64-5-9
> ;; global options: printcmd
> ;; Got answer:
> ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 4373
> ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
>
> ;; QUESTION SECTION:
> ;nh64-5-9. IN A
>
> ;; AUTHORITY SECTION:
> . 3600 IN SOA a.root-servers.net. nstld.verisign-grs.com.
> 2013070200 1800 900 604800 86400
>
> ;; Query time: 17 msec
> ;; SERVER: 10.1.1.1#53(10.1.1.1)
> ;; WHEN: Tue Jul 2 15:47:57 2013
> ;; MSG SIZE rcvd: 101
>
> However, `getent hosts` has a different reply:
>
> [rmurri_at_nh64-5-9 ~]$ getent hosts nh64-5-9
> 127.0.1.1 nh64-5-9.local nh64-5-9
>
>
>> 2. What happens when you try to resolve "foo.local"? (e.g., "dig foo.local")
>
> Here's what happens with `dig`:
>
> [rmurri_at_nh64-5-9 ~]$ dig nh64-5-9.local
>
> ; <<>> DiG 9.3.6-P1-RedHat-9.3.6-4.P1.el5 <<>> nh64-5-9.local
> ;; global options: printcmd
> ;; Got answer:
> ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 62092
> ;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 1
>
> ;; QUESTION SECTION:
> ;nh64-5-9.local. IN A
>
> ;; ANSWER SECTION:
> nh64-5-9.local. 259200 IN A 10.1.255.194
>
> ;; AUTHORITY SECTION:
> local. 259200 IN NS ns.local.
>
> ;; ADDITIONAL SECTION:
> ns.local. 259200 IN A 127.0.0.1
>
> ;; Query time: 0 msec
> ;; SERVER: 10.1.1.1#53(10.1.1.1)
> ;; WHEN: Tue Jul 2 15:48:50 2013
> ;; MSG SIZE rcvd: 81
>
> Same query resolved via `getent hosts`:
>
> [rmurri_at_nh64-5-9 ~]$ getent hosts nh64-5-9
> 127.0.1.1 nh64-5-9.local nh64-5-9
>
>
>> 3. What happens when you try to resolve "foo.yourdomain.com"? (e.g., "dig
>> foo.yourdomain.com")
>
> This yields an empty response from both `dig` and `getent hosts` as the node
> is only attached to a private network and not registered in DNS:
>
> [rmurri_at_nh64-5-9 ~]$ getent hosts nh64-5-9.uzh.ch
> [rmurri_at_nh64-5-9 ~]$ dig nh64-5-9.uzh.ch
>
> ; <<>> DiG 9.3.6-P1-RedHat-9.3.6-4.P1.el5 <<>> nh64-5-9.uzh.ch
> ;; global options: printcmd
> ;; Got answer:
> ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 61801
> ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
>
> ;; QUESTION SECTION:
> ;nh64-5-9.uzh.ch. IN A
>
> ;; AUTHORITY SECTION:
> uzh.ch. 8921 IN SOA ns1.uzh.ch. hostmaster.uzh.ch. 384627811
> 3600 1800 3600000 10800
>
> ;; Query time: 0 msec
> ;; SERVER: 10.1.1.1#53(10.1.1.1)
> ;; WHEN: Tue Jul 2 15:50:54 2013
> ;; MSG SIZE rcvd: 84
>
>
>> 4. Please apply the attached patch to your Open MPI 1.6.5 build (please note
>> that it adds diagnostic output; do *not* put this patch into production)
>> and:
>> 4a. Run with one of your "bad" cases and send us the output
>> 4b. Run with one of your "good" cases and send us the output
>
> Please find the outputs attached. The exact `mpiexec` invocation and
> the machines file are at the beginning of each file.
>
> Note that I allocated 8 slots (on 4 nodes), but only use 2 slots (on 1 node).
>
> Thanks,
> Riccardo
> <exam01.out.BAD><exam01.out.GOOD>_______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/