Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] -display-map
From: Ralph Castain (rhc_at_[hidden])
Date: 2008-10-17 10:46:47

Sorry for delay - had to ponder this one for awhile.

Jeff and I agree that adding something to ompi_info would not be a
good idea. Ompi_info has no knowledge or understanding of hostfiles,
and adding that capability to it would be a major distortion of its
intended use.

However, we think we can offer an alternative that might better solve
the problem. Remember, we now treat hostfiles in a very different
manner than before - see the wiki page for a complete description, or
"man orte_hosts".

So the problem is that, to provide you with what you want, we need to
"dump" the information from whatever default-hostfile was provided,
and, if no default-hostfile was provided, then the information from
each hostfile that was provided with an app_context.

The best way we could think of to do this is to add another mpirun cmd
line option --dump-hostfiles that would output the line-by-line name
from the hostfile plus the name we resolved it to. Of course, --xml
would cause it to be in xml format.

Would that meet your needs?


On Oct 15, 2008, at 3:12 PM, Greg Watson wrote:

> Hi Ralph,
> We've been discussing this back and forth a bit internally and don't
> really see an easy solution. Our problem is that Eclipse is not
> running on the head node, so gethostbyname will not necessarily
> resolve to the same address. For example, the hostfile might refer
> to the head node by an internal network address that is not visible
> to the outside world. Since gethostname also looks in /etc/hosts, it
> may resolve locally but not on a remote system. The only think I can
> think of would be, rather than us reading the hostfile directly as
> we do now, to provide an option to ompi_info that would dump the
> hostfile using the same rules that you apply when you're using the
> hostfile. Would that be feasible?
> Greg
> On Sep 22, 2008, at 4:25 PM, Ralph Castain wrote:
>> Sorry for delay - was on vacation and am now trying to work my way
>> back to the surface.
>> I'm not sure I can fix this one for two reasons:
>> 1. In general, OMPI doesn't really care what name is used for the
>> node. However, the problem is that it needs to be consistent. In
>> this case, ORTE has already used the name returned by gethostname
>> to create its session directory structure long before mpirun reads
>> a hostfile. This is why we retain the value from gethostname
>> instead of allowing it to be overwritten by the name in whatever
>> allocation we are given. Using the name in hostfile would require
>> that I either find some way to remember any prior name, or that I
>> tear down and rebuild the session directory tree - neither seems
>> attractive nor simple (e.g., what happens when the user provides
>> multiple entries in the hostfile for the node, each with a
>> different IP address based on another interface in that node?
>> Sounds crazy, but we have already seen it done - which one do I
>> use?).
>> 2. We don't actually store the hostfile info anywhere - we just use
>> it and forget it. For us to add an XML attribute containing any
>> hostfile-related info would therefore require us to re-read the
>> hostfile. I could have it do that -only- in the case of "XML output
>> required", but it seems rather ugly.
>> An alternative might be for you to simply do a "gethostbyname"
>> lookup of the IP address or hostname to see if it matches instead
>> of just doing a strcmp. This is what we have to do internally as we
>> frequently have problems with FQDN vs. non-FQDN vs. IP addresses
>> etc. If the local OS hasn't cached the IP address for the node in
>> question it can take a little time to DNS resolve it, but otherwise
>> works fine.
>> I can point you to the code in OPAL that we use - I would think
>> something similar would be easy to implement in your code and would
>> readily solve the problem.
>> Ralph
>> On Sep 19, 2008, at 7:18 AM, Greg Watson wrote:
>>> Ralph,
>>> The problem we're seeing is just with the head node. If I specify
>>> a particular IP address for the head node in the hostfile, it gets
>>> changed to the FQDN when displayed in the map. This is a problem
>>> for us as we need to be able to match the two, and since we're not
>>> necessarily running on the head node, we can't always do the same
>>> resolution you're doing.
>>> Would it be possible to use the same address that is specified in
>>> the hostfile, or alternatively provide an XML attribute that
>>> contains this information?
>>> Thanks,
>>> Greg
>>> On Sep 11, 2008, at 9:06 AM, Ralph Castain wrote:
>>>> Not in that regard, depending upon what you mean by "recently".
>>>> The only changes I am aware of wrt nodes consisted of some
>>>> changes to the order in which we use the nodes when specified by
>>>> hostfile or -host, and a little #if protectionism needed by Brian
>>>> for the Cray port.
>>>> Are you seeing this for every node? Reason I ask: I can't offhand
>>>> think of anything in the code base that would replace a host name
>>>> with the FQDN because we don't get that info for remote nodes.
>>>> The only exception is the head node (where mpirun sits) - in that
>>>> lone case, we default to the name returned to us by
>>>> gethostname(). We do that because the head node is frequently
>>>> accessible on a more global basis than the compute nodes - thus,
>>>> the FQDN is required to ensure that there is no address confusion
>>>> on the network.
>>>> If the user refers to compute nodes in a hostfile or -host (or in
>>>> an allocation from a resource manager) by non-FQDN, we just
>>>> assume they know what they are doing and the name will correctly
>>>> resolve to a unique address.
>>>> On Sep 10, 2008, at 9:45 AM, Greg Watson wrote:
>>>>> Hi,
>>>>> Has there been a change in the behavior of the -display-map
>>>>> option has changed recently in the 1.3 branch. We're now seeing
>>>>> the host name as a fully resolved DN rather than the entry that
>>>>> was specified in the hostfile. Is there any particular reason
>>>>> for this? If so, would it be possible to add the hostfile entry
>>>>> to the output since we need to be able to match the two?
>>>>> Thanks,
>>>>> Greg
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
> _______________________________________________
> devel mailing list
> devel_at_[hidden]