Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] -display-map
From: Ralph Castain (rhc_at_[hidden])
Date: 2008-12-08 14:05:23


Working its way around the CMR process now.

Might be easier in the future if we could test/debug this in the
trunk, though. Otherwise, the CMR procedure will fall behind and a fix
might miss a release window.

Anyway, hopefully this one will make the 1.3.0 release cutoff.

Thanks
Ralph

On Dec 8, 2008, at 9:56 AM, Greg Watson wrote:

> Hi Ralph,
>
> This is now in 1.3rc2, thanks. However there are a couple of
> problems. Here is what I see:
>
> [Jarrah.watson.ibm.com:58957] <noderesolve name="node0"
> resolved="Jarrah.watson.ibm.com">
>
> For some reason each line is prefixed with "[...]", any idea why
> this is? Also the end tag should be "/>" not ">".
>
> Thanks,
>
> Greg
>
> On Nov 24, 2008, at 3:06 PM, Greg Watson wrote:
>
>> Great, thanks. I'll take a look once it comes over to 1.3.
>>
>> Cheers,
>>
>> Greg
>>
>> On Nov 24, 2008, at 2:59 PM, Ralph Castain wrote:
>>
>>> Yo Greg
>>>
>>> This is in the trunk as of r20032. I'll bring it over to 1.3 in a
>>> few days.
>>>
>>> I implemented it as another MCA param
>>> "orte_show_resolved_nodenames" so you can actually get the info as
>>> you execute the job, if you want. The xml tag is "noderesolve" -
>>> let me know if you need any changes.
>>>
>>> Ralph
>>>
>>>
>>> On Oct 22, 2008, at 11:55 AM, Greg Watson wrote:
>>>
>>>> Ralph,
>>>>
>>>> I guess the issue for us is that we will have to run two commands
>>>> to get the information we need. One to get the configuration
>>>> information, such as version and MCA parameters, and one to get
>>>> the host information, whereas it would seem more logical that
>>>> this should all be available via some kind of "configuration
>>>> discovery" command. I understand the issue with supplying the
>>>> hostfile though, so maybe this just points at the need for us to
>>>> separate configuration information from the host information. In
>>>> any case, we'll work with what you think is best.
>>>>
>>>> Greg
>>>>
>>>> On Oct 20, 2008, at 4:49 PM, Ralph Castain wrote:
>>>>
>>>>> Hmmm...just to be sure we are all clear on this. The reason we
>>>>> proposed to use mpirun is that "hostfile" has no meaning outside
>>>>> of mpirun. That's why ompi_info can't do anything in this regard.
>>>>>
>>>>> We have no idea what hostfile the user may specify until we
>>>>> actually get the mpirun cmd line. They may have specified a
>>>>> default-hostfile, but they could also specify hostfiles for the
>>>>> individual app_contexts. These may or may not include the node
>>>>> upon which mpirun is executing.
>>>>>
>>>>> So the only way to provide you with a separate command to get a
>>>>> hostfile<->nodename mapping would require you to provide us with
>>>>> the default-hostifle and/or hostfile cmd line options just as if
>>>>> you were issuing the mpirun cmd. We just wouldn't launch - but
>>>>> it would be the exact equivalent of doing "mpirun --do-not-
>>>>> launch".
>>>>>
>>>>> Am I missing something? If so, please do correct me - I would be
>>>>> happy to provide a tool if that would make it easier. Just not
>>>>> sure what that tool would do.
>>>>>
>>>>> Thanks
>>>>> Ralph
>>>>>
>>>>>
>>>>> On Oct 19, 2008, at 1:59 PM, Greg Watson wrote:
>>>>>
>>>>>> Ralph,
>>>>>>
>>>>>> It seems a little strange to be using mpirun for this, but
>>>>>> barring providing a separate command, or using ompi_info, I
>>>>>> think this would solve our problem.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Greg
>>>>>>
>>>>>> On Oct 17, 2008, at 10:46 AM, Ralph Castain wrote:
>>>>>>
>>>>>>> Sorry for delay - had to ponder this one for awhile.
>>>>>>>
>>>>>>> Jeff and I agree that adding something to ompi_info would not
>>>>>>> be a good idea. Ompi_info has no knowledge or understanding of
>>>>>>> hostfiles, and adding that capability to it would be a major
>>>>>>> distortion of its intended use.
>>>>>>>
>>>>>>> However, we think we can offer an alternative that might
>>>>>>> better solve the problem. Remember, we now treat hostfiles in
>>>>>>> a very different manner than before - see the wiki page for a
>>>>>>> complete description, or "man orte_hosts".
>>>>>>>
>>>>>>> So the problem is that, to provide you with what you want, we
>>>>>>> need to "dump" the information from whatever default-hostfile
>>>>>>> was provided, and, if no default-hostfile was provided, then
>>>>>>> the information from each hostfile that was provided with an
>>>>>>> app_context.
>>>>>>>
>>>>>>> The best way we could think of to do this is to add another
>>>>>>> mpirun cmd line option --dump-hostfiles that would output the
>>>>>>> line-by-line name from the hostfile plus the name we resolved
>>>>>>> it to. Of course, --xml would cause it to be in xml format.
>>>>>>>
>>>>>>> Would that meet your needs?
>>>>>>>
>>>>>>> Ralph
>>>>>>>
>>>>>>>
>>>>>>> On Oct 15, 2008, at 3:12 PM, Greg Watson wrote:
>>>>>>>
>>>>>>>> Hi Ralph,
>>>>>>>>
>>>>>>>> We've been discussing this back and forth a bit internally
>>>>>>>> and don't really see an easy solution. Our problem is that
>>>>>>>> Eclipse is not running on the head node, so gethostbyname
>>>>>>>> will not necessarily resolve to the same address. For
>>>>>>>> example, the hostfile might refer to the head node by an
>>>>>>>> internal network address that is not visible to the outside
>>>>>>>> world. Since gethostname also looks in /etc/hosts, it may
>>>>>>>> resolve locally but not on a remote system. The only think I
>>>>>>>> can think of would be, rather than us reading the hostfile
>>>>>>>> directly as we do now, to provide an option to ompi_info that
>>>>>>>> would dump the hostfile using the same rules that you apply
>>>>>>>> when you're using the hostfile. Would that be feasible?
>>>>>>>>
>>>>>>>> Greg
>>>>>>>>
>>>>>>>> On Sep 22, 2008, at 4:25 PM, Ralph Castain wrote:
>>>>>>>>
>>>>>>>>> Sorry for delay - was on vacation and am now trying to work
>>>>>>>>> my way back to the surface.
>>>>>>>>>
>>>>>>>>> I'm not sure I can fix this one for two reasons:
>>>>>>>>>
>>>>>>>>> 1. In general, OMPI doesn't really care what name is used
>>>>>>>>> for the node. However, the problem is that it needs to be
>>>>>>>>> consistent. In this case, ORTE has already used the name
>>>>>>>>> returned by gethostname to create its session directory
>>>>>>>>> structure long before mpirun reads a hostfile. This is why
>>>>>>>>> we retain the value from gethostname instead of allowing it
>>>>>>>>> to be overwritten by the name in whatever allocation we are
>>>>>>>>> given. Using the name in hostfile would require that I
>>>>>>>>> either find some way to remember any prior name, or that I
>>>>>>>>> tear down and rebuild the session directory tree - neither
>>>>>>>>> seems attractive nor simple (e.g., what happens when the
>>>>>>>>> user provides multiple entries in the hostfile for the node,
>>>>>>>>> each with a different IP address based on another interface
>>>>>>>>> in that node? Sounds crazy, but we have already seen it done
>>>>>>>>> - which one do I use?).
>>>>>>>>>
>>>>>>>>> 2. We don't actually store the hostfile info anywhere - we
>>>>>>>>> just use it and forget it. For us to add an XML attribute
>>>>>>>>> containing any hostfile-related info would therefore require
>>>>>>>>> us to re-read the hostfile. I could have it do that -only-
>>>>>>>>> in the case of "XML output required", but it seems rather
>>>>>>>>> ugly.
>>>>>>>>>
>>>>>>>>> An alternative might be for you to simply do a
>>>>>>>>> "gethostbyname" lookup of the IP address or hostname to see
>>>>>>>>> if it matches instead of just doing a strcmp. This is what
>>>>>>>>> we have to do internally as we frequently have problems with
>>>>>>>>> FQDN vs. non-FQDN vs. IP addresses etc. If the local OS
>>>>>>>>> hasn't cached the IP address for the node in question it can
>>>>>>>>> take a little time to DNS resolve it, but otherwise works
>>>>>>>>> fine.
>>>>>>>>>
>>>>>>>>> I can point you to the code in OPAL that we use - I would
>>>>>>>>> think something similar would be easy to implement in your
>>>>>>>>> code and would readily solve the problem.
>>>>>>>>>
>>>>>>>>> Ralph
>>>>>>>>>
>>>>>>>>> On Sep 19, 2008, at 7:18 AM, Greg Watson wrote:
>>>>>>>>>
>>>>>>>>>> Ralph,
>>>>>>>>>>
>>>>>>>>>> The problem we're seeing is just with the head node. If I
>>>>>>>>>> specify a particular IP address for the head node in the
>>>>>>>>>> hostfile, it gets changed to the FQDN when displayed in the
>>>>>>>>>> map. This is a problem for us as we need to be able to
>>>>>>>>>> match the two, and since we're not necessarily running on
>>>>>>>>>> the head node, we can't always do the same resolution
>>>>>>>>>> you're doing.
>>>>>>>>>>
>>>>>>>>>> Would it be possible to use the same address that is
>>>>>>>>>> specified in the hostfile, or alternatively provide an XML
>>>>>>>>>> attribute that contains this information?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Greg
>>>>>>>>>>
>>>>>>>>>> On Sep 11, 2008, at 9:06 AM, Ralph Castain wrote:
>>>>>>>>>>
>>>>>>>>>>> Not in that regard, depending upon what you mean by
>>>>>>>>>>> "recently". The only changes I am aware of wrt nodes
>>>>>>>>>>> consisted of some changes to the order in which we use the
>>>>>>>>>>> nodes when specified by hostfile or -host, and a little
>>>>>>>>>>> #if protectionism needed by Brian for the Cray port.
>>>>>>>>>>>
>>>>>>>>>>> Are you seeing this for every node? Reason I ask: I can't
>>>>>>>>>>> offhand think of anything in the code base that would
>>>>>>>>>>> replace a host name with the FQDN because we don't get
>>>>>>>>>>> that info for remote nodes. The only exception is the head
>>>>>>>>>>> node (where mpirun sits) - in that lone case, we default
>>>>>>>>>>> to the name returned to us by gethostname(). We do that
>>>>>>>>>>> because the head node is frequently accessible on a more
>>>>>>>>>>> global basis than the compute nodes - thus, the FQDN is
>>>>>>>>>>> required to ensure that there is no address confusion on
>>>>>>>>>>> the network.
>>>>>>>>>>>
>>>>>>>>>>> If the user refers to compute nodes in a hostfile or -host
>>>>>>>>>>> (or in an allocation from a resource manager) by non-FQDN,
>>>>>>>>>>> we just assume they know what they are doing and the name
>>>>>>>>>>> will correctly resolve to a unique address.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sep 10, 2008, at 9:45 AM, Greg Watson wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> Has there been a change in the behavior of the -display-
>>>>>>>>>>>> map option has changed recently in the 1.3 branch. We're
>>>>>>>>>>>> now seeing the host name as a fully resolved DN rather
>>>>>>>>>>>> than the entry that was specified in the hostfile. Is
>>>>>>>>>>>> there any particular reason for this? If so, would it be
>>>>>>>>>>>> possible to add the hostfile entry to the output since we
>>>>>>>>>>>> need to be able to match the two?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> Greg
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> devel mailing list
>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> devel_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> devel_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel