Hi Jeff, Ralph,
first of all: thanks for your work on this!
On 3 July 2013 21:09, Jeff Squyres (jsquyres) <jsquyres_at_[hidden]> wrote:
> 1. The root cause of the issue is that you are assigning a
> non-existent IP address to a name. I.e., <foo> maps to 127.0.1.1,
> but that IP address does not exist anywhere. Hence, OMPI will never
> conclude that that <foo> is "local". If you had assigned <foo> to
> the 127.0.0.1 address, things should have worked fine.
Ok, I see. Would that have worked also if I had added the 127.0.1.1
address to the "lo" interface (in addition to 127.0.0.1)?
> Just curious: why are you doing this?
It's commonplace in Ubuntu/Debian installations; see, e.g.,
In our case, it was rolled out as a fix for some cron job running on
Apache servers (apparently Debian's Apache looks up 127.0.1.1 and uses
that as the ServerName, unless a server name is not explicitly
configured), which was later extended to all hosts because "what harm
can it do?".
(Needless to say, we have rolled back the change.)
> 2. That being said, OMPI is not currently looking at all the
> responses from gethostbyname() -- we're only looking at the first
> one. In the spirit of how clients are supposed to behave when
> multiple IP addresses are returned from a single name lookup, OMPI
> should examine all of those addresses and see if it finds one that
> it "likes", and then use that. So we should extend OMPI to examine
> all the IP addresses from gethostbyname().
Just for curiosity: would it have worked, had I compiled OMPI with
IPv6 support? (As far as I understand IPv6, an application is
required to examine all the addresses returned for a host name, and
not just pick the first one.)
> Ralph is going to work on this, but it'll likely take him a little
> time to get it done. We'll get it into the trunk and probably ask
> you to verify that it works for you. And if so, we'll back-port to
> the v1.6 and v1.7 series.
I'm glad to help and verify, but I guess we do not need the backport
or an urgent fix. The easy workaround for us was to remove the
127.0.1.1 line from the compute nodes (we keep it only on Apache
servers where it originated).