Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] -display-map
From: Ralph Castain (rhc_at_[hidden])
Date: 2008-10-20 16:49:19

Hmmm...just to be sure we are all clear on this. The reason we
proposed to use mpirun is that "hostfile" has no meaning outside of
mpirun. That's why ompi_info can't do anything in this regard.

We have no idea what hostfile the user may specify until we actually
get the mpirun cmd line. They may have specified a default-hostfile,
but they could also specify hostfiles for the individual app_contexts.
These may or may not include the node upon which mpirun is executing.

So the only way to provide you with a separate command to get a
hostfile<->nodename mapping would require you to provide us with the
default-hostifle and/or hostfile cmd line options just as if you were
issuing the mpirun cmd. We just wouldn't launch - but it would be the
exact equivalent of doing "mpirun --do-not-launch".

Am I missing something? If so, please do correct me - I would be happy
to provide a tool if that would make it easier. Just not sure what
that tool would do.


On Oct 19, 2008, at 1:59 PM, Greg Watson wrote:

> Ralph,
> It seems a little strange to be using mpirun for this, but barring
> providing a separate command, or using ompi_info, I think this would
> solve our problem.
> Thanks,
> Greg
> On Oct 17, 2008, at 10:46 AM, Ralph Castain wrote:
>> Sorry for delay - had to ponder this one for awhile.
>> Jeff and I agree that adding something to ompi_info would not be a
>> good idea. Ompi_info has no knowledge or understanding of
>> hostfiles, and adding that capability to it would be a major
>> distortion of its intended use.
>> However, we think we can offer an alternative that might better
>> solve the problem. Remember, we now treat hostfiles in a very
>> different manner than before - see the wiki page for a complete
>> description, or "man orte_hosts".
>> So the problem is that, to provide you with what you want, we need
>> to "dump" the information from whatever default-hostfile was
>> provided, and, if no default-hostfile was provided, then the
>> information from each hostfile that was provided with an app_context.
>> The best way we could think of to do this is to add another mpirun
>> cmd line option --dump-hostfiles that would output the line-by-line
>> name from the hostfile plus the name we resolved it to. Of course,
>> --xml would cause it to be in xml format.
>> Would that meet your needs?
>> Ralph
>> On Oct 15, 2008, at 3:12 PM, Greg Watson wrote:
>>> Hi Ralph,
>>> We've been discussing this back and forth a bit internally and
>>> don't really see an easy solution. Our problem is that Eclipse is
>>> not running on the head node, so gethostbyname will not
>>> necessarily resolve to the same address. For example, the hostfile
>>> might refer to the head node by an internal network address that
>>> is not visible to the outside world. Since gethostname also looks
>>> in /etc/hosts, it may resolve locally but not on a remote system.
>>> The only think I can think of would be, rather than us reading the
>>> hostfile directly as we do now, to provide an option to ompi_info
>>> that would dump the hostfile using the same rules that you apply
>>> when you're using the hostfile. Would that be feasible?
>>> Greg
>>> On Sep 22, 2008, at 4:25 PM, Ralph Castain wrote:
>>>> Sorry for delay - was on vacation and am now trying to work my
>>>> way back to the surface.
>>>> I'm not sure I can fix this one for two reasons:
>>>> 1. In general, OMPI doesn't really care what name is used for the
>>>> node. However, the problem is that it needs to be consistent. In
>>>> this case, ORTE has already used the name returned by gethostname
>>>> to create its session directory structure long before mpirun
>>>> reads a hostfile. This is why we retain the value from
>>>> gethostname instead of allowing it to be overwritten by the name
>>>> in whatever allocation we are given. Using the name in hostfile
>>>> would require that I either find some way to remember any prior
>>>> name, or that I tear down and rebuild the session directory tree
>>>> - neither seems attractive nor simple (e.g., what happens when
>>>> the user provides multiple entries in the hostfile for the node,
>>>> each with a different IP address based on another interface in
>>>> that node? Sounds crazy, but we have already seen it done - which
>>>> one do I use?).
>>>> 2. We don't actually store the hostfile info anywhere - we just
>>>> use it and forget it. For us to add an XML attribute containing
>>>> any hostfile-related info would therefore require us to re-read
>>>> the hostfile. I could have it do that -only- in the case of "XML
>>>> output required", but it seems rather ugly.
>>>> An alternative might be for you to simply do a "gethostbyname"
>>>> lookup of the IP address or hostname to see if it matches instead
>>>> of just doing a strcmp. This is what we have to do internally as
>>>> we frequently have problems with FQDN vs. non-FQDN vs. IP
>>>> addresses etc. If the local OS hasn't cached the IP address for
>>>> the node in question it can take a little time to DNS resolve it,
>>>> but otherwise works fine.
>>>> I can point you to the code in OPAL that we use - I would think
>>>> something similar would be easy to implement in your code and
>>>> would readily solve the problem.
>>>> Ralph
>>>> On Sep 19, 2008, at 7:18 AM, Greg Watson wrote:
>>>>> Ralph,
>>>>> The problem we're seeing is just with the head node. If I
>>>>> specify a particular IP address for the head node in the
>>>>> hostfile, it gets changed to the FQDN when displayed in the map.
>>>>> This is a problem for us as we need to be able to match the two,
>>>>> and since we're not necessarily running on the head node, we
>>>>> can't always do the same resolution you're doing.
>>>>> Would it be possible to use the same address that is specified
>>>>> in the hostfile, or alternatively provide an XML attribute that
>>>>> contains this information?
>>>>> Thanks,
>>>>> Greg
>>>>> On Sep 11, 2008, at 9:06 AM, Ralph Castain wrote:
>>>>>> Not in that regard, depending upon what you mean by "recently".
>>>>>> The only changes I am aware of wrt nodes consisted of some
>>>>>> changes to the order in which we use the nodes when specified
>>>>>> by hostfile or -host, and a little #if protectionism needed by
>>>>>> Brian for the Cray port.
>>>>>> Are you seeing this for every node? Reason I ask: I can't
>>>>>> offhand think of anything in the code base that would replace a
>>>>>> host name with the FQDN because we don't get that info for
>>>>>> remote nodes. The only exception is the head node (where mpirun
>>>>>> sits) - in that lone case, we default to the name returned to
>>>>>> us by gethostname(). We do that because the head node is
>>>>>> frequently accessible on a more global basis than the compute
>>>>>> nodes - thus, the FQDN is required to ensure that there is no
>>>>>> address confusion on the network.
>>>>>> If the user refers to compute nodes in a hostfile or -host (or
>>>>>> in an allocation from a resource manager) by non-FQDN, we just
>>>>>> assume they know what they are doing and the name will
>>>>>> correctly resolve to a unique address.
>>>>>> On Sep 10, 2008, at 9:45 AM, Greg Watson wrote:
>>>>>>> Hi,
>>>>>>> Has there been a change in the behavior of the -display-map
>>>>>>> option has changed recently in the 1.3 branch. We're now
>>>>>>> seeing the host name as a fully resolved DN rather than the
>>>>>>> entry that was specified in the hostfile. Is there any
>>>>>>> particular reason for this? If so, would it be possible to add
>>>>>>> the hostfile entry to the output since we need to be able to
>>>>>>> match the two?
>>>>>>> Thanks,
>>>>>>> Greg
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
> _______________________________________________
> devel mailing list
> devel_at_[hidden]