Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] -display-map
From: Greg Watson (g.watson_at_[hidden])
Date: 2008-10-19 15:59:05


It seems a little strange to be using mpirun for this, but barring
providing a separate command, or using ompi_info, I think this would
solve our problem.



On Oct 17, 2008, at 10:46 AM, Ralph Castain wrote:

> Sorry for delay - had to ponder this one for awhile.
> Jeff and I agree that adding something to ompi_info would not be a
> good idea. Ompi_info has no knowledge or understanding of hostfiles,
> and adding that capability to it would be a major distortion of its
> intended use.
> However, we think we can offer an alternative that might better
> solve the problem. Remember, we now treat hostfiles in a very
> different manner than before - see the wiki page for a complete
> description, or "man orte_hosts".
> So the problem is that, to provide you with what you want, we need
> to "dump" the information from whatever default-hostfile was
> provided, and, if no default-hostfile was provided, then the
> information from each hostfile that was provided with an app_context.
> The best way we could think of to do this is to add another mpirun
> cmd line option --dump-hostfiles that would output the line-by-line
> name from the hostfile plus the name we resolved it to. Of course, --
> xml would cause it to be in xml format.
> Would that meet your needs?
> Ralph
> On Oct 15, 2008, at 3:12 PM, Greg Watson wrote:
>> Hi Ralph,
>> We've been discussing this back and forth a bit internally and
>> don't really see an easy solution. Our problem is that Eclipse is
>> not running on the head node, so gethostbyname will not necessarily
>> resolve to the same address. For example, the hostfile might refer
>> to the head node by an internal network address that is not visible
>> to the outside world. Since gethostname also looks in /etc/hosts,
>> it may resolve locally but not on a remote system. The only think I
>> can think of would be, rather than us reading the hostfile directly
>> as we do now, to provide an option to ompi_info that would dump the
>> hostfile using the same rules that you apply when you're using the
>> hostfile. Would that be feasible?
>> Greg
>> On Sep 22, 2008, at 4:25 PM, Ralph Castain wrote:
>>> Sorry for delay - was on vacation and am now trying to work my way
>>> back to the surface.
>>> I'm not sure I can fix this one for two reasons:
>>> 1. In general, OMPI doesn't really care what name is used for the
>>> node. However, the problem is that it needs to be consistent. In
>>> this case, ORTE has already used the name returned by gethostname
>>> to create its session directory structure long before mpirun reads
>>> a hostfile. This is why we retain the value from gethostname
>>> instead of allowing it to be overwritten by the name in whatever
>>> allocation we are given. Using the name in hostfile would require
>>> that I either find some way to remember any prior name, or that I
>>> tear down and rebuild the session directory tree - neither seems
>>> attractive nor simple (e.g., what happens when the user provides
>>> multiple entries in the hostfile for the node, each with a
>>> different IP address based on another interface in that node?
>>> Sounds crazy, but we have already seen it done - which one do I
>>> use?).
>>> 2. We don't actually store the hostfile info anywhere - we just
>>> use it and forget it. For us to add an XML attribute containing
>>> any hostfile-related info would therefore require us to re-read
>>> the hostfile. I could have it do that -only- in the case of "XML
>>> output required", but it seems rather ugly.
>>> An alternative might be for you to simply do a "gethostbyname"
>>> lookup of the IP address or hostname to see if it matches instead
>>> of just doing a strcmp. This is what we have to do internally as
>>> we frequently have problems with FQDN vs. non-FQDN vs. IP
>>> addresses etc. If the local OS hasn't cached the IP address for
>>> the node in question it can take a little time to DNS resolve it,
>>> but otherwise works fine.
>>> I can point you to the code in OPAL that we use - I would think
>>> something similar would be easy to implement in your code and
>>> would readily solve the problem.
>>> Ralph
>>> On Sep 19, 2008, at 7:18 AM, Greg Watson wrote:
>>>> Ralph,
>>>> The problem we're seeing is just with the head node. If I specify
>>>> a particular IP address for the head node in the hostfile, it
>>>> gets changed to the FQDN when displayed in the map. This is a
>>>> problem for us as we need to be able to match the two, and since
>>>> we're not necessarily running on the head node, we can't always
>>>> do the same resolution you're doing.
>>>> Would it be possible to use the same address that is specified in
>>>> the hostfile, or alternatively provide an XML attribute that
>>>> contains this information?
>>>> Thanks,
>>>> Greg
>>>> On Sep 11, 2008, at 9:06 AM, Ralph Castain wrote:
>>>>> Not in that regard, depending upon what you mean by "recently".
>>>>> The only changes I am aware of wrt nodes consisted of some
>>>>> changes to the order in which we use the nodes when specified by
>>>>> hostfile or -host, and a little #if protectionism needed by
>>>>> Brian for the Cray port.
>>>>> Are you seeing this for every node? Reason I ask: I can't
>>>>> offhand think of anything in the code base that would replace a
>>>>> host name with the FQDN because we don't get that info for
>>>>> remote nodes. The only exception is the head node (where mpirun
>>>>> sits) - in that lone case, we default to the name returned to us
>>>>> by gethostname(). We do that because the head node is frequently
>>>>> accessible on a more global basis than the compute nodes - thus,
>>>>> the FQDN is required to ensure that there is no address
>>>>> confusion on the network.
>>>>> If the user refers to compute nodes in a hostfile or -host (or
>>>>> in an allocation from a resource manager) by non-FQDN, we just
>>>>> assume they know what they are doing and the name will correctly
>>>>> resolve to a unique address.
>>>>> On Sep 10, 2008, at 9:45 AM, Greg Watson wrote:
>>>>>> Hi,
>>>>>> Has there been a change in the behavior of the -display-map
>>>>>> option has changed recently in the 1.3 branch. We're now seeing
>>>>>> the host name as a fully resolved DN rather than the entry that
>>>>>> was specified in the hostfile. Is there any particular reason
>>>>>> for this? If so, would it be possible to add the hostfile entry
>>>>>> to the output since we need to be able to match the two?
>>>>>> Thanks,
>>>>>> Greg
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
> _______________________________________________
> devel mailing list
> devel_at_[hidden]