Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] -display-map
From: Greg Watson (g.watson_at_[hidden])
Date: 2008-12-08 14:18:49


Ok thanks. I'll test from trunk in future.

Greg

On Dec 8, 2008, at 2:05 PM, Ralph Castain wrote:

> Working its way around the CMR process now.
>
> Might be easier in the future if we could test/debug this in the
> trunk, though. Otherwise, the CMR procedure will fall behind and a
> fix might miss a release window.
>
> Anyway, hopefully this one will make the 1.3.0 release cutoff.
>
> Thanks
> Ralph
>
> On Dec 8, 2008, at 9:56 AM, Greg Watson wrote:
>
>> Hi Ralph,
>>
>> This is now in 1.3rc2, thanks. However there are a couple of
>> problems. Here is what I see:
>>
>> [Jarrah.watson.ibm.com:58957] <noderesolve name="node0"
>> resolved="Jarrah.watson.ibm.com">
>>
>> For some reason each line is prefixed with "[...]", any idea why
>> this is? Also the end tag should be "/>" not ">".
>>
>> Thanks,
>>
>> Greg
>>
>> On Nov 24, 2008, at 3:06 PM, Greg Watson wrote:
>>
>>> Great, thanks. I'll take a look once it comes over to 1.3.
>>>
>>> Cheers,
>>>
>>> Greg
>>>
>>> On Nov 24, 2008, at 2:59 PM, Ralph Castain wrote:
>>>
>>>> Yo Greg
>>>>
>>>> This is in the trunk as of r20032. I'll bring it over to 1.3 in a
>>>> few days.
>>>>
>>>> I implemented it as another MCA param
>>>> "orte_show_resolved_nodenames" so you can actually get the info
>>>> as you execute the job, if you want. The xml tag is "noderesolve"
>>>> - let me know if you need any changes.
>>>>
>>>> Ralph
>>>>
>>>>
>>>> On Oct 22, 2008, at 11:55 AM, Greg Watson wrote:
>>>>
>>>>> Ralph,
>>>>>
>>>>> I guess the issue for us is that we will have to run two
>>>>> commands to get the information we need. One to get the
>>>>> configuration information, such as version and MCA parameters,
>>>>> and one to get the host information, whereas it would seem more
>>>>> logical that this should all be available via some kind of
>>>>> "configuration discovery" command. I understand the issue with
>>>>> supplying the hostfile though, so maybe this just points at the
>>>>> need for us to separate configuration information from the host
>>>>> information. In any case, we'll work with what you think is best.
>>>>>
>>>>> Greg
>>>>>
>>>>> On Oct 20, 2008, at 4:49 PM, Ralph Castain wrote:
>>>>>
>>>>>> Hmmm...just to be sure we are all clear on this. The reason we
>>>>>> proposed to use mpirun is that "hostfile" has no meaning
>>>>>> outside of mpirun. That's why ompi_info can't do anything in
>>>>>> this regard.
>>>>>>
>>>>>> We have no idea what hostfile the user may specify until we
>>>>>> actually get the mpirun cmd line. They may have specified a
>>>>>> default-hostfile, but they could also specify hostfiles for the
>>>>>> individual app_contexts. These may or may not include the node
>>>>>> upon which mpirun is executing.
>>>>>>
>>>>>> So the only way to provide you with a separate command to get a
>>>>>> hostfile<->nodename mapping would require you to provide us
>>>>>> with the default-hostifle and/or hostfile cmd line options just
>>>>>> as if you were issuing the mpirun cmd. We just wouldn't launch
>>>>>> - but it would be the exact equivalent of doing "mpirun --do-
>>>>>> not-launch".
>>>>>>
>>>>>> Am I missing something? If so, please do correct me - I would
>>>>>> be happy to provide a tool if that would make it easier. Just
>>>>>> not sure what that tool would do.
>>>>>>
>>>>>> Thanks
>>>>>> Ralph
>>>>>>
>>>>>>
>>>>>> On Oct 19, 2008, at 1:59 PM, Greg Watson wrote:
>>>>>>
>>>>>>> Ralph,
>>>>>>>
>>>>>>> It seems a little strange to be using mpirun for this, but
>>>>>>> barring providing a separate command, or using ompi_info, I
>>>>>>> think this would solve our problem.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Greg
>>>>>>>
>>>>>>> On Oct 17, 2008, at 10:46 AM, Ralph Castain wrote:
>>>>>>>
>>>>>>>> Sorry for delay - had to ponder this one for awhile.
>>>>>>>>
>>>>>>>> Jeff and I agree that adding something to ompi_info would not
>>>>>>>> be a good idea. Ompi_info has no knowledge or understanding
>>>>>>>> of hostfiles, and adding that capability to it would be a
>>>>>>>> major distortion of its intended use.
>>>>>>>>
>>>>>>>> However, we think we can offer an alternative that might
>>>>>>>> better solve the problem. Remember, we now treat hostfiles in
>>>>>>>> a very different manner than before - see the wiki page for a
>>>>>>>> complete description, or "man orte_hosts".
>>>>>>>>
>>>>>>>> So the problem is that, to provide you with what you want, we
>>>>>>>> need to "dump" the information from whatever default-hostfile
>>>>>>>> was provided, and, if no default-hostfile was provided, then
>>>>>>>> the information from each hostfile that was provided with an
>>>>>>>> app_context.
>>>>>>>>
>>>>>>>> The best way we could think of to do this is to add another
>>>>>>>> mpirun cmd line option --dump-hostfiles that would output the
>>>>>>>> line-by-line name from the hostfile plus the name we resolved
>>>>>>>> it to. Of course, --xml would cause it to be in xml format.
>>>>>>>>
>>>>>>>> Would that meet your needs?
>>>>>>>>
>>>>>>>> Ralph
>>>>>>>>
>>>>>>>>
>>>>>>>> On Oct 15, 2008, at 3:12 PM, Greg Watson wrote:
>>>>>>>>
>>>>>>>>> Hi Ralph,
>>>>>>>>>
>>>>>>>>> We've been discussing this back and forth a bit internally
>>>>>>>>> and don't really see an easy solution. Our problem is that
>>>>>>>>> Eclipse is not running on the head node, so gethostbyname
>>>>>>>>> will not necessarily resolve to the same address. For
>>>>>>>>> example, the hostfile might refer to the head node by an
>>>>>>>>> internal network address that is not visible to the outside
>>>>>>>>> world. Since gethostname also looks in /etc/hosts, it may
>>>>>>>>> resolve locally but not on a remote system. The only think I
>>>>>>>>> can think of would be, rather than us reading the hostfile
>>>>>>>>> directly as we do now, to provide an option to ompi_info
>>>>>>>>> that would dump the hostfile using the same rules that you
>>>>>>>>> apply when you're using the hostfile. Would that be feasible?
>>>>>>>>>
>>>>>>>>> Greg
>>>>>>>>>
>>>>>>>>> On Sep 22, 2008, at 4:25 PM, Ralph Castain wrote:
>>>>>>>>>
>>>>>>>>>> Sorry for delay - was on vacation and am now trying to work
>>>>>>>>>> my way back to the surface.
>>>>>>>>>>
>>>>>>>>>> I'm not sure I can fix this one for two reasons:
>>>>>>>>>>
>>>>>>>>>> 1. In general, OMPI doesn't really care what name is used
>>>>>>>>>> for the node. However, the problem is that it needs to be
>>>>>>>>>> consistent. In this case, ORTE has already used the name
>>>>>>>>>> returned by gethostname to create its session directory
>>>>>>>>>> structure long before mpirun reads a hostfile. This is why
>>>>>>>>>> we retain the value from gethostname instead of allowing it
>>>>>>>>>> to be overwritten by the name in whatever allocation we are
>>>>>>>>>> given. Using the name in hostfile would require that I
>>>>>>>>>> either find some way to remember any prior name, or that I
>>>>>>>>>> tear down and rebuild the session directory tree - neither
>>>>>>>>>> seems attractive nor simple (e.g., what happens when the
>>>>>>>>>> user provides multiple entries in the hostfile for the
>>>>>>>>>> node, each with a different IP address based on another
>>>>>>>>>> interface in that node? Sounds crazy, but we have already
>>>>>>>>>> seen it done - which one do I use?).
>>>>>>>>>>
>>>>>>>>>> 2. We don't actually store the hostfile info anywhere - we
>>>>>>>>>> just use it and forget it. For us to add an XML attribute
>>>>>>>>>> containing any hostfile-related info would therefore
>>>>>>>>>> require us to re-read the hostfile. I could have it do that
>>>>>>>>>> -only- in the case of "XML output required", but it seems
>>>>>>>>>> rather ugly.
>>>>>>>>>>
>>>>>>>>>> An alternative might be for you to simply do a
>>>>>>>>>> "gethostbyname" lookup of the IP address or hostname to see
>>>>>>>>>> if it matches instead of just doing a strcmp. This is what
>>>>>>>>>> we have to do internally as we frequently have problems
>>>>>>>>>> with FQDN vs. non-FQDN vs. IP addresses etc. If the local
>>>>>>>>>> OS hasn't cached the IP address for the node in question it
>>>>>>>>>> can take a little time to DNS resolve it, but otherwise
>>>>>>>>>> works fine.
>>>>>>>>>>
>>>>>>>>>> I can point you to the code in OPAL that we use - I would
>>>>>>>>>> think something similar would be easy to implement in your
>>>>>>>>>> code and would readily solve the problem.
>>>>>>>>>>
>>>>>>>>>> Ralph
>>>>>>>>>>
>>>>>>>>>> On Sep 19, 2008, at 7:18 AM, Greg Watson wrote:
>>>>>>>>>>
>>>>>>>>>>> Ralph,
>>>>>>>>>>>
>>>>>>>>>>> The problem we're seeing is just with the head node. If I
>>>>>>>>>>> specify a particular IP address for the head node in the
>>>>>>>>>>> hostfile, it gets changed to the FQDN when displayed in
>>>>>>>>>>> the map. This is a problem for us as we need to be able to
>>>>>>>>>>> match the two, and since we're not necessarily running on
>>>>>>>>>>> the head node, we can't always do the same resolution
>>>>>>>>>>> you're doing.
>>>>>>>>>>>
>>>>>>>>>>> Would it be possible to use the same address that is
>>>>>>>>>>> specified in the hostfile, or alternatively provide an XML
>>>>>>>>>>> attribute that contains this information?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Greg
>>>>>>>>>>>
>>>>>>>>>>> On Sep 11, 2008, at 9:06 AM, Ralph Castain wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Not in that regard, depending upon what you mean by
>>>>>>>>>>>> "recently". The only changes I am aware of wrt nodes
>>>>>>>>>>>> consisted of some changes to the order in which we use
>>>>>>>>>>>> the nodes when specified by hostfile or -host, and a
>>>>>>>>>>>> little #if protectionism needed by Brian for the Cray port.
>>>>>>>>>>>>
>>>>>>>>>>>> Are you seeing this for every node? Reason I ask: I can't
>>>>>>>>>>>> offhand think of anything in the code base that would
>>>>>>>>>>>> replace a host name with the FQDN because we don't get
>>>>>>>>>>>> that info for remote nodes. The only exception is the
>>>>>>>>>>>> head node (where mpirun sits) - in that lone case, we
>>>>>>>>>>>> default to the name returned to us by gethostname(). We
>>>>>>>>>>>> do that because the head node is frequently accessible on
>>>>>>>>>>>> a more global basis than the compute nodes - thus, the
>>>>>>>>>>>> FQDN is required to ensure that there is no address
>>>>>>>>>>>> confusion on the network.
>>>>>>>>>>>>
>>>>>>>>>>>> If the user refers to compute nodes in a hostfile or -
>>>>>>>>>>>> host (or in an allocation from a resource manager) by non-
>>>>>>>>>>>> FQDN, we just assume they know what they are doing and
>>>>>>>>>>>> the name will correctly resolve to a unique address.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sep 10, 2008, at 9:45 AM, Greg Watson wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Has there been a change in the behavior of the -display-
>>>>>>>>>>>>> map option has changed recently in the 1.3 branch. We're
>>>>>>>>>>>>> now seeing the host name as a fully resolved DN rather
>>>>>>>>>>>>> than the entry that was specified in the hostfile. Is
>>>>>>>>>>>>> there any particular reason for this? If so, would it be
>>>>>>>>>>>>> possible to add the hostfile entry to the output since
>>>>>>>>>>>>> we need to be able to match the two?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Greg
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> devel mailing list
>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> devel_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> devel_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel