Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi 1.6.3 fails to identify local host if its IP is 127.0.1.1
From: Riccardo Murri (riccardo.murri_at_[hidden])
Date: 2013-07-02 09:53:05


Hi,

sorry for the delay in replying -- pretty busy week :-(

On 28 June 2013 21:54, Jeff Squyres (jsquyres) <jsquyres_at_[hidden]> wrote:
> Here's what we think we know (I'm using the name "foo" instead of
> your actual hostname because it's easier to type):
>
> 1. When you run "hostname", you get foo.local back

Yes.

> 2. In your /etc/hosts file, foo.local is listed on two lines:
> 127.0.1.1
> 10.1.255.201
>

Yes:

    [rmurri_at_nh64-5-9 ~]$ fgrep nh64-5-9 /etc/hosts
    127.0.1.1 nh64-5-9.local nh64-5-9
    10.1.255.194 nh64-5-9.local nh64-5-9

> 3. When you login to the "foo" server and execute mpirun with a hostfile
> that contains "foo", Open MPI incorrectly thinks that the local machine is
> not foo, and therefore tries to ssh to it (and things go downhill from
> there).
>

Yes.

> 4. When you login to the "foo" server and execute mpirun with a hostfile
> that contains "foo.local" (you said "FQDN", but never said exactly what you
> meant by that -- I'm assuming "foo.local", not "foo.yourdomain.com"), then
> Open MPI behaves properly.
>

Yes.

FQDN = foo.local. (This is a compute node in a cluster that does not
have any public IP address not DNS entry -- it only has an interface
to the cluster-private network. I presume this is not relevant to
OpenMPI as long as all names are correctly resolved via `/etc/hosts`.)

> Is that all correct?

Yes, all correct.

> We have some followup questions for you:
>
> 1. What happens when you try to resolve "foo"? (e.g., via the "dig" program
> -- "dig foo")

Here's what happens with `dig`:

    [rmurri_at_nh64-5-9 ~]$ dig nh64-5-9

    ; <<>> DiG 9.3.6-P1-RedHat-9.3.6-4.P1.el5 <<>> nh64-5-9
    ;; global options: printcmd
    ;; Got answer:
    ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 4373
    ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0

    ;; QUESTION SECTION:
    ;nh64-5-9. IN A

    ;; AUTHORITY SECTION:
    . 3600 IN SOA a.root-servers.net. nstld.verisign-grs.com.
2013070200 1800 900 604800 86400

    ;; Query time: 17 msec
    ;; SERVER: 10.1.1.1#53(10.1.1.1)
    ;; WHEN: Tue Jul 2 15:47:57 2013
    ;; MSG SIZE rcvd: 101

However, `getent hosts` has a different reply:

    [rmurri_at_nh64-5-9 ~]$ getent hosts nh64-5-9
    127.0.1.1 nh64-5-9.local nh64-5-9

> 2. What happens when you try to resolve "foo.local"? (e.g., "dig foo.local")

Here's what happens with `dig`:

    [rmurri_at_nh64-5-9 ~]$ dig nh64-5-9.local

    ; <<>> DiG 9.3.6-P1-RedHat-9.3.6-4.P1.el5 <<>> nh64-5-9.local
    ;; global options: printcmd
    ;; Got answer:
    ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 62092
    ;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 1

    ;; QUESTION SECTION:
    ;nh64-5-9.local. IN A

    ;; ANSWER SECTION:
    nh64-5-9.local. 259200 IN A 10.1.255.194

    ;; AUTHORITY SECTION:
    local. 259200 IN NS ns.local.

    ;; ADDITIONAL SECTION:
    ns.local. 259200 IN A 127.0.0.1

    ;; Query time: 0 msec
    ;; SERVER: 10.1.1.1#53(10.1.1.1)
    ;; WHEN: Tue Jul 2 15:48:50 2013
    ;; MSG SIZE rcvd: 81

Same query resolved via `getent hosts`:

    [rmurri_at_nh64-5-9 ~]$ getent hosts nh64-5-9
    127.0.1.1 nh64-5-9.local nh64-5-9

> 3. What happens when you try to resolve "foo.yourdomain.com"? (e.g., "dig
> foo.yourdomain.com")

This yields an empty response from both `dig` and `getent hosts` as the node
is only attached to a private network and not registered in DNS:

    [rmurri_at_nh64-5-9 ~]$ getent hosts nh64-5-9.uzh.ch
    [rmurri_at_nh64-5-9 ~]$ dig nh64-5-9.uzh.ch

    ; <<>> DiG 9.3.6-P1-RedHat-9.3.6-4.P1.el5 <<>> nh64-5-9.uzh.ch
    ;; global options: printcmd
    ;; Got answer:
    ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 61801
    ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0

    ;; QUESTION SECTION:
    ;nh64-5-9.uzh.ch. IN A

    ;; AUTHORITY SECTION:
    uzh.ch. 8921 IN SOA ns1.uzh.ch. hostmaster.uzh.ch. 384627811
3600 1800 3600000 10800

    ;; Query time: 0 msec
    ;; SERVER: 10.1.1.1#53(10.1.1.1)
    ;; WHEN: Tue Jul 2 15:50:54 2013
    ;; MSG SIZE rcvd: 84

> 4. Please apply the attached patch to your Open MPI 1.6.5 build (please note
> that it adds diagnostic output; do *not* put this patch into production)
> and:
> 4a. Run with one of your "bad" cases and send us the output
> 4b. Run with one of your "good" cases and send us the output

Please find the outputs attached. The exact `mpiexec` invocation and
the machines file are at the beginning of each file.

Note that I allocated 8 slots (on 4 nodes), but only use 2 slots (on 1 node).

Thanks,
Riccardo