Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] -display-map
From: Ralph Castain (rhc_at_[hidden])
Date: 2008-11-24 14:59:15


Yo Greg

This is in the trunk as of r20032. I'll bring it over to 1.3 in a few
days.

I implemented it as another MCA param "orte_show_resolved_nodenames"
so you can actually get the info as you execute the job, if you want.
The xml tag is "noderesolve" - let me know if you need any changes.

Ralph

On Oct 22, 2008, at 11:55 AM, Greg Watson wrote:

> Ralph,
>
> I guess the issue for us is that we will have to run two commands to
> get the information we need. One to get the configuration
> information, such as version and MCA parameters, and one to get the
> host information, whereas it would seem more logical that this
> should all be available via some kind of "configuration discovery"
> command. I understand the issue with supplying the hostfile though,
> so maybe this just points at the need for us to separate
> configuration information from the host information. In any case,
> we'll work with what you think is best.
>
> Greg
>
> On Oct 20, 2008, at 4:49 PM, Ralph Castain wrote:
>
>> Hmmm...just to be sure we are all clear on this. The reason we
>> proposed to use mpirun is that "hostfile" has no meaning outside of
>> mpirun. That's why ompi_info can't do anything in this regard.
>>
>> We have no idea what hostfile the user may specify until we
>> actually get the mpirun cmd line. They may have specified a default-
>> hostfile, but they could also specify hostfiles for the individual
>> app_contexts. These may or may not include the node upon which
>> mpirun is executing.
>>
>> So the only way to provide you with a separate command to get a
>> hostfile<->nodename mapping would require you to provide us with
>> the default-hostifle and/or hostfile cmd line options just as if
>> you were issuing the mpirun cmd. We just wouldn't launch - but it
>> would be the exact equivalent of doing "mpirun --do-not-launch".
>>
>> Am I missing something? If so, please do correct me - I would be
>> happy to provide a tool if that would make it easier. Just not sure
>> what that tool would do.
>>
>> Thanks
>> Ralph
>>
>>
>> On Oct 19, 2008, at 1:59 PM, Greg Watson wrote:
>>
>>> Ralph,
>>>
>>> It seems a little strange to be using mpirun for this, but barring
>>> providing a separate command, or using ompi_info, I think this
>>> would solve our problem.
>>>
>>> Thanks,
>>>
>>> Greg
>>>
>>> On Oct 17, 2008, at 10:46 AM, Ralph Castain wrote:
>>>
>>>> Sorry for delay - had to ponder this one for awhile.
>>>>
>>>> Jeff and I agree that adding something to ompi_info would not be
>>>> a good idea. Ompi_info has no knowledge or understanding of
>>>> hostfiles, and adding that capability to it would be a major
>>>> distortion of its intended use.
>>>>
>>>> However, we think we can offer an alternative that might better
>>>> solve the problem. Remember, we now treat hostfiles in a very
>>>> different manner than before - see the wiki page for a complete
>>>> description, or "man orte_hosts".
>>>>
>>>> So the problem is that, to provide you with what you want, we
>>>> need to "dump" the information from whatever default-hostfile was
>>>> provided, and, if no default-hostfile was provided, then the
>>>> information from each hostfile that was provided with an
>>>> app_context.
>>>>
>>>> The best way we could think of to do this is to add another
>>>> mpirun cmd line option --dump-hostfiles that would output the
>>>> line-by-line name from the hostfile plus the name we resolved it
>>>> to. Of course, --xml would cause it to be in xml format.
>>>>
>>>> Would that meet your needs?
>>>>
>>>> Ralph
>>>>
>>>>
>>>> On Oct 15, 2008, at 3:12 PM, Greg Watson wrote:
>>>>
>>>>> Hi Ralph,
>>>>>
>>>>> We've been discussing this back and forth a bit internally and
>>>>> don't really see an easy solution. Our problem is that Eclipse
>>>>> is not running on the head node, so gethostbyname will not
>>>>> necessarily resolve to the same address. For example, the
>>>>> hostfile might refer to the head node by an internal network
>>>>> address that is not visible to the outside world. Since
>>>>> gethostname also looks in /etc/hosts, it may resolve locally but
>>>>> not on a remote system. The only think I can think of would be,
>>>>> rather than us reading the hostfile directly as we do now, to
>>>>> provide an option to ompi_info that would dump the hostfile
>>>>> using the same rules that you apply when you're using the
>>>>> hostfile. Would that be feasible?
>>>>>
>>>>> Greg
>>>>>
>>>>> On Sep 22, 2008, at 4:25 PM, Ralph Castain wrote:
>>>>>
>>>>>> Sorry for delay - was on vacation and am now trying to work my
>>>>>> way back to the surface.
>>>>>>
>>>>>> I'm not sure I can fix this one for two reasons:
>>>>>>
>>>>>> 1. In general, OMPI doesn't really care what name is used for
>>>>>> the node. However, the problem is that it needs to be
>>>>>> consistent. In this case, ORTE has already used the name
>>>>>> returned by gethostname to create its session directory
>>>>>> structure long before mpirun reads a hostfile. This is why we
>>>>>> retain the value from gethostname instead of allowing it to be
>>>>>> overwritten by the name in whatever allocation we are given.
>>>>>> Using the name in hostfile would require that I either find
>>>>>> some way to remember any prior name, or that I tear down and
>>>>>> rebuild the session directory tree - neither seems attractive
>>>>>> nor simple (e.g., what happens when the user provides multiple
>>>>>> entries in the hostfile for the node, each with a different IP
>>>>>> address based on another interface in that node? Sounds crazy,
>>>>>> but we have already seen it done - which one do I use?).
>>>>>>
>>>>>> 2. We don't actually store the hostfile info anywhere - we just
>>>>>> use it and forget it. For us to add an XML attribute containing
>>>>>> any hostfile-related info would therefore require us to re-read
>>>>>> the hostfile. I could have it do that -only- in the case of
>>>>>> "XML output required", but it seems rather ugly.
>>>>>>
>>>>>> An alternative might be for you to simply do a "gethostbyname"
>>>>>> lookup of the IP address or hostname to see if it matches
>>>>>> instead of just doing a strcmp. This is what we have to do
>>>>>> internally as we frequently have problems with FQDN vs. non-
>>>>>> FQDN vs. IP addresses etc. If the local OS hasn't cached the IP
>>>>>> address for the node in question it can take a little time to
>>>>>> DNS resolve it, but otherwise works fine.
>>>>>>
>>>>>> I can point you to the code in OPAL that we use - I would think
>>>>>> something similar would be easy to implement in your code and
>>>>>> would readily solve the problem.
>>>>>>
>>>>>> Ralph
>>>>>>
>>>>>> On Sep 19, 2008, at 7:18 AM, Greg Watson wrote:
>>>>>>
>>>>>>> Ralph,
>>>>>>>
>>>>>>> The problem we're seeing is just with the head node. If I
>>>>>>> specify a particular IP address for the head node in the
>>>>>>> hostfile, it gets changed to the FQDN when displayed in the
>>>>>>> map. This is a problem for us as we need to be able to match
>>>>>>> the two, and since we're not necessarily running on the head
>>>>>>> node, we can't always do the same resolution you're doing.
>>>>>>>
>>>>>>> Would it be possible to use the same address that is specified
>>>>>>> in the hostfile, or alternatively provide an XML attribute
>>>>>>> that contains this information?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Greg
>>>>>>>
>>>>>>> On Sep 11, 2008, at 9:06 AM, Ralph Castain wrote:
>>>>>>>
>>>>>>>> Not in that regard, depending upon what you mean by
>>>>>>>> "recently". The only changes I am aware of wrt nodes
>>>>>>>> consisted of some changes to the order in which we use the
>>>>>>>> nodes when specified by hostfile or -host, and a little #if
>>>>>>>> protectionism needed by Brian for the Cray port.
>>>>>>>>
>>>>>>>> Are you seeing this for every node? Reason I ask: I can't
>>>>>>>> offhand think of anything in the code base that would replace
>>>>>>>> a host name with the FQDN because we don't get that info for
>>>>>>>> remote nodes. The only exception is the head node (where
>>>>>>>> mpirun sits) - in that lone case, we default to the name
>>>>>>>> returned to us by gethostname(). We do that because the head
>>>>>>>> node is frequently accessible on a more global basis than the
>>>>>>>> compute nodes - thus, the FQDN is required to ensure that
>>>>>>>> there is no address confusion on the network.
>>>>>>>>
>>>>>>>> If the user refers to compute nodes in a hostfile or -host
>>>>>>>> (or in an allocation from a resource manager) by non-FQDN, we
>>>>>>>> just assume they know what they are doing and the name will
>>>>>>>> correctly resolve to a unique address.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sep 10, 2008, at 9:45 AM, Greg Watson wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Has there been a change in the behavior of the -display-map
>>>>>>>>> option has changed recently in the 1.3 branch. We're now
>>>>>>>>> seeing the host name as a fully resolved DN rather than the
>>>>>>>>> entry that was specified in the hostfile. Is there any
>>>>>>>>> particular reason for this? If so, would it be possible to
>>>>>>>>> add the hostfile entry to the output since we need to be
>>>>>>>>> able to match the two?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Greg
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> devel_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> devel_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel