Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] -display-map
From: Greg Watson (g.watson_at_[hidden])
Date: 2008-11-24 15:06:37

Great, thanks. I'll take a look once it comes over to 1.3.



On Nov 24, 2008, at 2:59 PM, Ralph Castain wrote:

> Yo Greg
> This is in the trunk as of r20032. I'll bring it over to 1.3 in a
> few days.
> I implemented it as another MCA param "orte_show_resolved_nodenames"
> so you can actually get the info as you execute the job, if you
> want. The xml tag is "noderesolve" - let me know if you need any
> changes.
> Ralph
> On Oct 22, 2008, at 11:55 AM, Greg Watson wrote:
>> Ralph,
>> I guess the issue for us is that we will have to run two commands
>> to get the information we need. One to get the configuration
>> information, such as version and MCA parameters, and one to get the
>> host information, whereas it would seem more logical that this
>> should all be available via some kind of "configuration discovery"
>> command. I understand the issue with supplying the hostfile though,
>> so maybe this just points at the need for us to separate
>> configuration information from the host information. In any case,
>> we'll work with what you think is best.
>> Greg
>> On Oct 20, 2008, at 4:49 PM, Ralph Castain wrote:
>>> Hmmm...just to be sure we are all clear on this. The reason we
>>> proposed to use mpirun is that "hostfile" has no meaning outside
>>> of mpirun. That's why ompi_info can't do anything in this regard.
>>> We have no idea what hostfile the user may specify until we
>>> actually get the mpirun cmd line. They may have specified a
>>> default-hostfile, but they could also specify hostfiles for the
>>> individual app_contexts. These may or may not include the node
>>> upon which mpirun is executing.
>>> So the only way to provide you with a separate command to get a
>>> hostfile<->nodename mapping would require you to provide us with
>>> the default-hostifle and/or hostfile cmd line options just as if
>>> you were issuing the mpirun cmd. We just wouldn't launch - but it
>>> would be the exact equivalent of doing "mpirun --do-not-launch".
>>> Am I missing something? If so, please do correct me - I would be
>>> happy to provide a tool if that would make it easier. Just not
>>> sure what that tool would do.
>>> Thanks
>>> Ralph
>>> On Oct 19, 2008, at 1:59 PM, Greg Watson wrote:
>>>> Ralph,
>>>> It seems a little strange to be using mpirun for this, but
>>>> barring providing a separate command, or using ompi_info, I think
>>>> this would solve our problem.
>>>> Thanks,
>>>> Greg
>>>> On Oct 17, 2008, at 10:46 AM, Ralph Castain wrote:
>>>>> Sorry for delay - had to ponder this one for awhile.
>>>>> Jeff and I agree that adding something to ompi_info would not be
>>>>> a good idea. Ompi_info has no knowledge or understanding of
>>>>> hostfiles, and adding that capability to it would be a major
>>>>> distortion of its intended use.
>>>>> However, we think we can offer an alternative that might better
>>>>> solve the problem. Remember, we now treat hostfiles in a very
>>>>> different manner than before - see the wiki page for a complete
>>>>> description, or "man orte_hosts".
>>>>> So the problem is that, to provide you with what you want, we
>>>>> need to "dump" the information from whatever default-hostfile
>>>>> was provided, and, if no default-hostfile was provided, then the
>>>>> information from each hostfile that was provided with an
>>>>> app_context.
>>>>> The best way we could think of to do this is to add another
>>>>> mpirun cmd line option --dump-hostfiles that would output the
>>>>> line-by-line name from the hostfile plus the name we resolved it
>>>>> to. Of course, --xml would cause it to be in xml format.
>>>>> Would that meet your needs?
>>>>> Ralph
>>>>> On Oct 15, 2008, at 3:12 PM, Greg Watson wrote:
>>>>>> Hi Ralph,
>>>>>> We've been discussing this back and forth a bit internally and
>>>>>> don't really see an easy solution. Our problem is that Eclipse
>>>>>> is not running on the head node, so gethostbyname will not
>>>>>> necessarily resolve to the same address. For example, the
>>>>>> hostfile might refer to the head node by an internal network
>>>>>> address that is not visible to the outside world. Since
>>>>>> gethostname also looks in /etc/hosts, it may resolve locally
>>>>>> but not on a remote system. The only think I can think of would
>>>>>> be, rather than us reading the hostfile directly as we do now,
>>>>>> to provide an option to ompi_info that would dump the hostfile
>>>>>> using the same rules that you apply when you're using the
>>>>>> hostfile. Would that be feasible?
>>>>>> Greg
>>>>>> On Sep 22, 2008, at 4:25 PM, Ralph Castain wrote:
>>>>>>> Sorry for delay - was on vacation and am now trying to work my
>>>>>>> way back to the surface.
>>>>>>> I'm not sure I can fix this one for two reasons:
>>>>>>> 1. In general, OMPI doesn't really care what name is used for
>>>>>>> the node. However, the problem is that it needs to be
>>>>>>> consistent. In this case, ORTE has already used the name
>>>>>>> returned by gethostname to create its session directory
>>>>>>> structure long before mpirun reads a hostfile. This is why we
>>>>>>> retain the value from gethostname instead of allowing it to be
>>>>>>> overwritten by the name in whatever allocation we are given.
>>>>>>> Using the name in hostfile would require that I either find
>>>>>>> some way to remember any prior name, or that I tear down and
>>>>>>> rebuild the session directory tree - neither seems attractive
>>>>>>> nor simple (e.g., what happens when the user provides multiple
>>>>>>> entries in the hostfile for the node, each with a different IP
>>>>>>> address based on another interface in that node? Sounds crazy,
>>>>>>> but we have already seen it done - which one do I use?).
>>>>>>> 2. We don't actually store the hostfile info anywhere - we
>>>>>>> just use it and forget it. For us to add an XML attribute
>>>>>>> containing any hostfile-related info would therefore require
>>>>>>> us to re-read the hostfile. I could have it do that -only- in
>>>>>>> the case of "XML output required", but it seems rather ugly.
>>>>>>> An alternative might be for you to simply do a "gethostbyname"
>>>>>>> lookup of the IP address or hostname to see if it matches
>>>>>>> instead of just doing a strcmp. This is what we have to do
>>>>>>> internally as we frequently have problems with FQDN vs. non-
>>>>>>> FQDN vs. IP addresses etc. If the local OS hasn't cached the
>>>>>>> IP address for the node in question it can take a little time
>>>>>>> to DNS resolve it, but otherwise works fine.
>>>>>>> I can point you to the code in OPAL that we use - I would
>>>>>>> think something similar would be easy to implement in your
>>>>>>> code and would readily solve the problem.
>>>>>>> Ralph
>>>>>>> On Sep 19, 2008, at 7:18 AM, Greg Watson wrote:
>>>>>>>> Ralph,
>>>>>>>> The problem we're seeing is just with the head node. If I
>>>>>>>> specify a particular IP address for the head node in the
>>>>>>>> hostfile, it gets changed to the FQDN when displayed in the
>>>>>>>> map. This is a problem for us as we need to be able to match
>>>>>>>> the two, and since we're not necessarily running on the head
>>>>>>>> node, we can't always do the same resolution you're doing.
>>>>>>>> Would it be possible to use the same address that is
>>>>>>>> specified in the hostfile, or alternatively provide an XML
>>>>>>>> attribute that contains this information?
>>>>>>>> Thanks,
>>>>>>>> Greg
>>>>>>>> On Sep 11, 2008, at 9:06 AM, Ralph Castain wrote:
>>>>>>>>> Not in that regard, depending upon what you mean by
>>>>>>>>> "recently". The only changes I am aware of wrt nodes
>>>>>>>>> consisted of some changes to the order in which we use the
>>>>>>>>> nodes when specified by hostfile or -host, and a little #if
>>>>>>>>> protectionism needed by Brian for the Cray port.
>>>>>>>>> Are you seeing this for every node? Reason I ask: I can't
>>>>>>>>> offhand think of anything in the code base that would
>>>>>>>>> replace a host name with the FQDN because we don't get that
>>>>>>>>> info for remote nodes. The only exception is the head node
>>>>>>>>> (where mpirun sits) - in that lone case, we default to the
>>>>>>>>> name returned to us by gethostname(). We do that because the
>>>>>>>>> head node is frequently accessible on a more global basis
>>>>>>>>> than the compute nodes - thus, the FQDN is required to
>>>>>>>>> ensure that there is no address confusion on the network.
>>>>>>>>> If the user refers to compute nodes in a hostfile or -host
>>>>>>>>> (or in an allocation from a resource manager) by non-FQDN,
>>>>>>>>> we just assume they know what they are doing and the name
>>>>>>>>> will correctly resolve to a unique address.
>>>>>>>>> On Sep 10, 2008, at 9:45 AM, Greg Watson wrote:
>>>>>>>>>> Hi,
>>>>>>>>>> Has there been a change in the behavior of the -display-map
>>>>>>>>>> option has changed recently in the 1.3 branch. We're now
>>>>>>>>>> seeing the host name as a fully resolved DN rather than the
>>>>>>>>>> entry that was specified in the hostfile. Is there any
>>>>>>>>>> particular reason for this? If so, would it be possible to
>>>>>>>>>> add the hostfile entry to the output since we need to be
>>>>>>>>>> able to match the two?
>>>>>>>>>> Thanks,
>>>>>>>>>> Greg
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> devel_at_[hidden]
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> devel_at_[hidden]
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> devel_at_[hidden]
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
> _______________________________________________
> devel mailing list
> devel_at_[hidden]