We've been discussing this back and forth a bit internally and don't
really see an easy solution. Our problem is that Eclipse is not
running on the head node, so gethostbyname will not necessarily
resolve to the same address. For example, the hostfile might refer to
the head node by an internal network address that is not visible to
the outside world. Since gethostname also looks in /etc/hosts, it may
resolve locally but not on a remote system. The only think I can think
of would be, rather than us reading the hostfile directly as we do
now, to provide an option to ompi_info that would dump the hostfile
using the same rules that you apply when you're using the hostfile.
Would that be feasible?
On Sep 22, 2008, at 4:25 PM, Ralph Castain wrote:
> Sorry for delay - was on vacation and am now trying to work my way
> back to the surface.
> I'm not sure I can fix this one for two reasons:
> 1. In general, OMPI doesn't really care what name is used for the
> node. However, the problem is that it needs to be consistent. In
> this case, ORTE has already used the name returned by gethostname to
> create its session directory structure long before mpirun reads a
> hostfile. This is why we retain the value from gethostname instead
> of allowing it to be overwritten by the name in whatever allocation
> we are given. Using the name in hostfile would require that I either
> find some way to remember any prior name, or that I tear down and
> rebuild the session directory tree - neither seems attractive nor
> simple (e.g., what happens when the user provides multiple entries
> in the hostfile for the node, each with a different IP address based
> on another interface in that node? Sounds crazy, but we have already
> seen it done - which one do I use?).
> 2. We don't actually store the hostfile info anywhere - we just use
> it and forget it. For us to add an XML attribute containing any
> hostfile-related info would therefore require us to re-read the
> hostfile. I could have it do that -only- in the case of "XML output
> required", but it seems rather ugly.
> An alternative might be for you to simply do a "gethostbyname"
> lookup of the IP address or hostname to see if it matches instead of
> just doing a strcmp. This is what we have to do internally as we
> frequently have problems with FQDN vs. non-FQDN vs. IP addresses
> etc. If the local OS hasn't cached the IP address for the node in
> question it can take a little time to DNS resolve it, but otherwise
> works fine.
> I can point you to the code in OPAL that we use - I would think
> something similar would be easy to implement in your code and would
> readily solve the problem.
> On Sep 19, 2008, at 7:18 AM, Greg Watson wrote:
>> The problem we're seeing is just with the head node. If I specify a
>> particular IP address for the head node in the hostfile, it gets
>> changed to the FQDN when displayed in the map. This is a problem
>> for us as we need to be able to match the two, and since we're not
>> necessarily running on the head node, we can't always do the same
>> resolution you're doing.
>> Would it be possible to use the same address that is specified in
>> the hostfile, or alternatively provide an XML attribute that
>> contains this information?
>> On Sep 11, 2008, at 9:06 AM, Ralph Castain wrote:
>>> Not in that regard, depending upon what you mean by "recently".
>>> The only changes I am aware of wrt nodes consisted of some changes
>>> to the order in which we use the nodes when specified by hostfile
>>> or -host, and a little #if protectionism needed by Brian for the
>>> Cray port.
>>> Are you seeing this for every node? Reason I ask: I can't offhand
>>> think of anything in the code base that would replace a host name
>>> with the FQDN because we don't get that info for remote nodes. The
>>> only exception is the head node (where mpirun sits) - in that lone
>>> case, we default to the name returned to us by gethostname(). We
>>> do that because the head node is frequently accessible on a more
>>> global basis than the compute nodes - thus, the FQDN is required
>>> to ensure that there is no address confusion on the network.
>>> If the user refers to compute nodes in a hostfile or -host (or in
>>> an allocation from a resource manager) by non-FQDN, we just assume
>>> they know what they are doing and the name will correctly resolve
>>> to a unique address.
>>> On Sep 10, 2008, at 9:45 AM, Greg Watson wrote:
>>>> Has there been a change in the behavior of the -display-map
>>>> option has changed recently in the 1.3 branch. We're now seeing
>>>> the host name as a fully resolved DN rather than the entry that
>>>> was specified in the hostfile. Is there any particular reason for
>>>> this? If so, would it be possible to add the hostfile entry to
>>>> the output since we need to be able to match the two?
>>>> devel mailing list
>>> devel mailing list
>> devel mailing list
> devel mailing list