Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] -display-map
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-01-16 13:08:08


References: <78C4B4D7-D9BC-4268-97CF-8C1111BADBA1_at_[hidden]> <C36E96BD-E4F1-41DB-8FA3-E2483F7F7A7C_at_[hidden]> <9317BD55-13A2-44BE-BCC0-3E42E2322E68_at_[hidden]> <5CB48A5D-1CE3-48F7-8890-C99239B0A68B_at_[hidden]> <22EBE824-0000-47F1-A954-8B54536BF210_at_[hidden]> <EDEA61F8-E092-4BAD-BDA5-EBE527D306BB_at_[hidden]> <6DDA0348-96B4-4E3F-91B4-490631CFED10_at_[hidden]> <EFA196BB-CA06-4173-92AC-60F16767F09A_at_[hidden]> <CA149099-0EE9-4C27-8F98-6B64FDB534BE_at_[hidden]> <460591D2-BD7B-43CA-9B1E-1B4E021274FF_at_[hidden]> <BEA01081-07CC-4927-BD0E-C04BBE4E5B72_at_[hidden]> <B9E807DC-8479-4364-A759-8BF5D9819B1B_at_[hidden]> <4D997767-D893-43E7-BD4A-41266C9B40C2_at_[hidden]> <206DC9CD-AA61-4E7C-8A28-7DD3279CE76A_at_[hidden]> <B479A1C0-EB8E-4E47-8A79-8E10544B4F6A_at_[hidden]> <5175DC9A-EE1F-4B38-BE89-EB55FCEF3CB0_at_[hidden]> <66736892-CE43-464C-B439-7ED03DDB05C7_at_[hidden]> <E53CEF11-D805-42B1-9F76-8E619E002C57_at_[hidden]> <8D67C754-D192-45EE-B4E8-071F67D783D7_at_[hidden]> <6D116CE6-9A8B-407E-A2D7-1 F716E8274C8_at_[hidden]> <7D175
B97-DF6A-42DB-81B9-4D9663861339_at_[hidden]> <19EF4971-0390-4992-A8A7-CBC6B7189D18_at_[hidden]>
X-Mailer: Apple Mail (2.930.3)
Return-Path: jsquyres_at_[hidden]
X-OriginalArrivalTime: 16 Jan 2009 18:08:11.0165 (UTC) FILETIME=[646AA4D0:01C97805]

Er... whoops. This looks like my mistake (I just recently add
MPI_REDUCE_LOCAL to the trunk -- not v1.3).

I could have sworn that I tested this on a Mac, multiple times. I'll
test again...

On Jan 16, 2009, at 12:58 PM, Greg Watson wrote:

> When I try to build trunk, it fails with:
>
> i_f77.lax/libmpi_f77_pmpi.a/pwin_unlock_f.o .libs/libmpi_f77.lax/
> libmpi_f77_pmpi.a/pwin_wait_f.o .libs/libmpi_f77.lax/
> libmpi_f77_pmpi.a/pwtick_f.o .libs/libmpi_f77.lax/libmpi_f77_pmpi.a/
> pwtime_f.o ../../../ompi/.libs/libmpi.0.0.0.dylib /usr/local/
> openmpi-1.4-devel/lib/libopen-rte.0.0.0.dylib /usr/local/openmpi-1.4-
> devel/lib/libopen-pal.0.0.0.dylib -install_name /usr/local/
> openmpi-1.4-devel/lib/libmpi_f77.0.dylib -compatibility_version 1 -
> current_version 1.0
> ld: duplicate symbol _mpi_reduce_local_f in .libs/libmpi_f77.lax/
> libmpi_f77_pmpi.a/preduce_local_f.o and .libs/reduce_local_f.o
>
> collect2: ld returned 1 exit status
> make[3]: *** [libmpi_f77.la] Error 1
> make[2]: *** [all-recursive] Error 1
> make[1]: *** [all-recursive] Error 1
> make: *** [all-recursive] Error 1
>
> I'm using the default configure command (./configure --prefix=xxx)
> on Mac OS X 10.5. This works fine on the 1.3 branch.
>
> Greg
>
> On Jan 15, 2009, at 1:13 PM, Ralph Castain wrote:
>
>> Okay, it is in the trunk as of r20284 - I'll file the request to
>> have it moved to 1.3.1.
>>
>> Let me know if you get a chance to test the stdout/err stuff in the
>> trunk - we should try and iterate it so any changes can make 1.3.1
>> as well.
>>
>> Thanks!
>> Ralph
>>
>>
>> On Jan 15, 2009, at 11:03 AM, Greg Watson wrote:
>>
>>> Ralph,
>>>
>>> I think the second form would be ideal and would simplify things
>>> greatly.
>>>
>>> Greg
>>>
>>> On Jan 15, 2009, at 10:53 AM, Ralph Castain wrote:
>>>
>>>> Here is what I was able to do - note that the resolve messages
>>>> are associated with the specific hostname, not the overall map:
>>>>
>>>> <map>
>>>> <host name="graywolf54.lanl.gov" slots="1" max_slots="0">
>>>> <noderesolve name="graywolf54.lanl.gov" resolved="localhost"/>
>>>> <process rank="0"/>
>>>> <process rank="1"/>
>>>> <process rank="2"/>
>>>> </host>
>>>> </map>
>>>>
>>>> Will that work for you? If you like, I can remove the name= field
>>>> from the noderesolve element since the info is specific to the
>>>> host element that contains it. In other words, I can make it look
>>>> like this:
>>>>
>>>> <map>
>>>> <host name="graywolf54.lanl.gov" slots="1" max_slots="0">
>>>> <noderesolve resolved="localhost"/>
>>>> <process rank="0"/>
>>>> <process rank="1"/>
>>>> <process rank="2"/>
>>>> </host>
>>>> </map>
>>>>
>>>> if that would help.
>>>>
>>>> Ralph
>>>>
>>>>
>>>> On Jan 14, 2009, at 7:57 AM, Ralph Castain wrote:
>>>>
>>>>> We -may- be able to do a more formal XML output at some point.
>>>>> The problem will be the natural interleaving of stdout/err from
>>>>> the various procs due to the async behavior of MPI. Mpirun
>>>>> receives fragmented output in the forwarding system, limited by
>>>>> the buffer sizes and the amount of data we can read at any one
>>>>> "bite" from the pipes connecting us to the procs. So even though
>>>>> the user -thinks- they output a single large line of stuff, it
>>>>> may show up at mpirun as a series of fragments. Hence, it gets
>>>>> tricky to know how to put appropriate XML brackets around it.
>>>>>
>>>>> Given this input about when you actually want resolved name
>>>>> info, I can at least do something about that area. Won't be in
>>>>> 1.3.0, but should make 1.3.1.
>>>>>
>>>>> As for XML-tagged stdout/err: the OMPI community asked me not to
>>>>> turn that feature "on" for 1.3.0 as they felt it hasn't been
>>>>> adequately tested yet. The code is present, but cannot be
>>>>> activated in 1.3.0. However, I believe it is activated on the
>>>>> trunk when you do --xml --tagged-output, so perhaps some testing
>>>>> will help us debug and validate it adequately for 1.3.1?
>>>>>
>>>>> Thanks
>>>>> Ralph
>>>>>
>>>>>
>>>>> On Jan 14, 2009, at 7:02 AM, Greg Watson wrote:
>>>>>
>>>>>> Ralph,
>>>>>>
>>>>>> The only time we use the resolved names is when we get a map,
>>>>>> so we consider them part of the map output.
>>>>>>
>>>>>> If quasi-XML is all that will ever be possible with 1.3, then
>>>>>> you may as well leave as-is and we will attempt to clean it up
>>>>>> in Eclipse. It would be nice if a future version of ompi could
>>>>>> output correct XML (including stdout) as this would vastly
>>>>>> simplify the parsing we need to do.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Greg
>>>>>>
>>>>>> On Jan 13, 2009, at 3:30 PM, Ralph Castain wrote:
>>>>>>
>>>>>>> Hmmm...well, I can't do either for 1.3.0 as it is departing
>>>>>>> this afternoon.
>>>>>>>
>>>>>>> The first option would be very hard to do. I would have to
>>>>>>> expose the display-map option across the code base and check
>>>>>>> it prior to printing anything about resolving node names. I
>>>>>>> guess I should ask: do you only want noderesolve statements
>>>>>>> when we are displaying the map? Right now, I will output them
>>>>>>> regardless.
>>>>>>>
>>>>>>> The second option could be done. I could check if any
>>>>>>> "display" option has been specified, and output the <ompi>
>>>>>>> root at that time (likewise for the end). Anything we output
>>>>>>> in-between would be encapsulated between the two, but that
>>>>>>> would include any user output to stdout and/or stderr - which
>>>>>>> for 1.3.0 is not in xml.
>>>>>>>
>>>>>>> Any thoughts?
>>>>>>>
>>>>>>> Ralph
>>>>>>>
>>>>>>> PS. Guess I should clarify that I was not striving for true
>>>>>>> XML interaction here, but rather a quasi-XML format that would
>>>>>>> help you to filter the output. I have no problem trying to get
>>>>>>> to something more formally correct, but it could be tricky in
>>>>>>> some places to achieve it due to the inherent async nature of
>>>>>>> the beast.
>>>>>>>
>>>>>>>
>>>>>>> On Jan 13, 2009, at 12:17 PM, Greg Watson wrote:
>>>>>>>
>>>>>>>> Ralph,
>>>>>>>>
>>>>>>>> The XML is looking better now, but there is still one
>>>>>>>> problem. To be valid, there needs to be only one root
>>>>>>>> element, but currently you don't have any (or many). So
>>>>>>>> rather than:
>>>>>>>>
>>>>>>>> <noderesolve name="node0" resolved="Jarrah.local"/>
>>>>>>>> <noderesolve name="node1" resolved="Jarrah.local"/>
>>>>>>>> <map>
>>>>>>>> <host name="Jarrah.local" slots="8" max_slots="0">
>>>>>>>> <process rank="0"/>
>>>>>>>> <process rank="1"/>
>>>>>>>> <process rank="2"/>
>>>>>>>> <process rank="3"/>
>>>>>>>> <process rank="4"/>
>>>>>>>> </host>
>>>>>>>> </map>
>>>>>>>>
>>>>>>>> the XML should be:
>>>>>>>>
>>>>>>>> <map>
>>>>>>>> <noderesolve name="node0" resolved="Jarrah.local"/>
>>>>>>>> <noderesolve name="node1" resolved="Jarrah.local"/>
>>>>>>>> <host name="Jarrah.local" slots="8" max_slots="0">
>>>>>>>> <process rank="0"/>
>>>>>>>> <process rank="1"/>
>>>>>>>> <process rank="2"/>
>>>>>>>> <process rank="3"/>
>>>>>>>> <process rank="4"/>
>>>>>>>> </host>
>>>>>>>> </map>
>>>>>>>>
>>>>>>>> or:
>>>>>>>>
>>>>>>>> <ompi>
>>>>>>>> <noderesolve name="node0" resolved="Jarrah.local"/>
>>>>>>>> <noderesolve name="node1" resolved="Jarrah.local"/>
>>>>>>>> <map>
>>>>>>>> <host name="Jarrah.local" slots="8" max_slots="0">
>>>>>>>> <process rank="0"/>
>>>>>>>> <process rank="1"/>
>>>>>>>> <process rank="2"/>
>>>>>>>> <process rank="3"/>
>>>>>>>> <process rank="4"/>
>>>>>>>> </host>
>>>>>>>> </map>
>>>>>>>> </ompi>
>>>>>>>>
>>>>>>>> Would either of these be possible?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Greg
>>>>>>>>
>>>>>>>> On Dec 8, 2008, at 2:18 PM, Greg Watson wrote:
>>>>>>>>
>>>>>>>>> Ok thanks. I'll test from trunk in future.
>>>>>>>>>
>>>>>>>>> Greg
>>>>>>>>>
>>>>>>>>> On Dec 8, 2008, at 2:05 PM, Ralph Castain wrote:
>>>>>>>>>
>>>>>>>>>> Working its way around the CMR process now.
>>>>>>>>>>
>>>>>>>>>> Might be easier in the future if we could test/debug this
>>>>>>>>>> in the trunk, though. Otherwise, the CMR procedure will
>>>>>>>>>> fall behind and a fix might miss a release window.
>>>>>>>>>>
>>>>>>>>>> Anyway, hopefully this one will make the 1.3.0 release
>>>>>>>>>> cutoff.
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Ralph
>>>>>>>>>>
>>>>>>>>>> On Dec 8, 2008, at 9:56 AM, Greg Watson wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Ralph,
>>>>>>>>>>>
>>>>>>>>>>> This is now in 1.3rc2, thanks. However there are a couple
>>>>>>>>>>> of problems. Here is what I see:
>>>>>>>>>>>
>>>>>>>>>>> [Jarrah.watson.ibm.com:58957] <noderesolve name="node0"
>>>>>>>>>>> resolved="Jarrah.watson.ibm.com">
>>>>>>>>>>>
>>>>>>>>>>> For some reason each line is prefixed with "[...]", any
>>>>>>>>>>> idea why this is? Also the end tag should be "/>" not ">".
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Greg
>>>>>>>>>>>
>>>>>>>>>>> On Nov 24, 2008, at 3:06 PM, Greg Watson wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Great, thanks. I'll take a look once it comes over to 1.3.
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>
>>>>>>>>>>>> Greg
>>>>>>>>>>>>
>>>>>>>>>>>> On Nov 24, 2008, at 2:59 PM, Ralph Castain wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Yo Greg
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is in the trunk as of r20032. I'll bring it over to
>>>>>>>>>>>>> 1.3 in a few days.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I implemented it as another MCA param
>>>>>>>>>>>>> "orte_show_resolved_nodenames" so you can actually get
>>>>>>>>>>>>> the info as you execute the job, if you want. The xml
>>>>>>>>>>>>> tag is "noderesolve" - let me know if you need any
>>>>>>>>>>>>> changes.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ralph
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Oct 22, 2008, at 11:55 AM, Greg Watson wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ralph,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I guess the issue for us is that we will have to run
>>>>>>>>>>>>>> two commands to get the information we need. One to get
>>>>>>>>>>>>>> the configuration information, such as version and MCA
>>>>>>>>>>>>>> parameters, and one to get the host information,
>>>>>>>>>>>>>> whereas it would seem more logical that this should all
>>>>>>>>>>>>>> be available via some kind of "configuration discovery"
>>>>>>>>>>>>>> command. I understand the issue with supplying the
>>>>>>>>>>>>>> hostfile though, so maybe this just points at the need
>>>>>>>>>>>>>> for us to separate configuration information from the
>>>>>>>>>>>>>> host information. In any case, we'll work with what you
>>>>>>>>>>>>>> think is best.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Greg
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Oct 20, 2008, at 4:49 PM, Ralph Castain wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hmmm...just to be sure we are all clear on this. The
>>>>>>>>>>>>>>> reason we proposed to use mpirun is that "hostfile"
>>>>>>>>>>>>>>> has no meaning outside of mpirun. That's why ompi_info
>>>>>>>>>>>>>>> can't do anything in this regard.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We have no idea what hostfile the user may specify
>>>>>>>>>>>>>>> until we actually get the mpirun cmd line. They may
>>>>>>>>>>>>>>> have specified a default-hostfile, but they could also
>>>>>>>>>>>>>>> specify hostfiles for the individual app_contexts.
>>>>>>>>>>>>>>> These may or may not include the node upon which
>>>>>>>>>>>>>>> mpirun is executing.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So the only way to provide you with a separate command
>>>>>>>>>>>>>>> to get a hostfile<->nodename mapping would require you
>>>>>>>>>>>>>>> to provide us with the default-hostifle and/or
>>>>>>>>>>>>>>> hostfile cmd line options just as if you were issuing
>>>>>>>>>>>>>>> the mpirun cmd. We just wouldn't launch - but it would
>>>>>>>>>>>>>>> be the exact equivalent of doing "mpirun --do-not-
>>>>>>>>>>>>>>> launch".
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Am I missing something? If so, please do correct me -
>>>>>>>>>>>>>>> I would be happy to provide a tool if that would make
>>>>>>>>>>>>>>> it easier. Just not sure what that tool would do.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>> Ralph
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Oct 19, 2008, at 1:59 PM, Greg Watson wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Ralph,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It seems a little strange to be using mpirun for
>>>>>>>>>>>>>>>> this, but barring providing a separate command, or
>>>>>>>>>>>>>>>> using ompi_info, I think this would solve our problem.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Greg
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Oct 17, 2008, at 10:46 AM, Ralph Castain wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Sorry for delay - had to ponder this one for awhile.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Jeff and I agree that adding something to ompi_info
>>>>>>>>>>>>>>>>> would not be a good idea. Ompi_info has no knowledge
>>>>>>>>>>>>>>>>> or understanding of hostfiles, and adding that
>>>>>>>>>>>>>>>>> capability to it would be a major distortion of its
>>>>>>>>>>>>>>>>> intended use.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> However, we think we can offer an alternative that
>>>>>>>>>>>>>>>>> might better solve the problem. Remember, we now
>>>>>>>>>>>>>>>>> treat hostfiles in a very different manner than
>>>>>>>>>>>>>>>>> before - see the wiki page for a complete
>>>>>>>>>>>>>>>>> description, or "man orte_hosts".
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> So the problem is that, to provide you with what you
>>>>>>>>>>>>>>>>> want, we need to "dump" the information from
>>>>>>>>>>>>>>>>> whatever default-hostfile was provided, and, if no
>>>>>>>>>>>>>>>>> default-hostfile was provided, then the information
>>>>>>>>>>>>>>>>> from each hostfile that was provided with an
>>>>>>>>>>>>>>>>> app_context.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The best way we could think of to do this is to add
>>>>>>>>>>>>>>>>> another mpirun cmd line option --dump-hostfiles that
>>>>>>>>>>>>>>>>> would output the line-by-line name from the hostfile
>>>>>>>>>>>>>>>>> plus the name we resolved it to. Of course, --xml
>>>>>>>>>>>>>>>>> would cause it to be in xml format.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Would that meet your needs?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Ralph
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Oct 15, 2008, at 3:12 PM, Greg Watson wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Ralph,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> We've been discussing this back and forth a bit
>>>>>>>>>>>>>>>>>> internally and don't really see an easy solution.
>>>>>>>>>>>>>>>>>> Our problem is that Eclipse is not running on the
>>>>>>>>>>>>>>>>>> head node, so gethostbyname will not necessarily
>>>>>>>>>>>>>>>>>> resolve to the same address. For example, the
>>>>>>>>>>>>>>>>>> hostfile might refer to the head node by an
>>>>>>>>>>>>>>>>>> internal network address that is not visible to the
>>>>>>>>>>>>>>>>>> outside world. Since gethostname also looks in /etc/
>>>>>>>>>>>>>>>>>> hosts, it may resolve locally but not on a remote
>>>>>>>>>>>>>>>>>> system. The only think I can think of would be,
>>>>>>>>>>>>>>>>>> rather than us reading the hostfile directly as we
>>>>>>>>>>>>>>>>>> do now, to provide an option to ompi_info that
>>>>>>>>>>>>>>>>>> would dump the hostfile using the same rules that
>>>>>>>>>>>>>>>>>> you apply when you're using the hostfile. Would
>>>>>>>>>>>>>>>>>> that be feasible?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Greg
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sep 22, 2008, at 4:25 PM, Ralph Castain wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Sorry for delay - was on vacation and am now
>>>>>>>>>>>>>>>>>>> trying to work my way back to the surface.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I'm not sure I can fix this one for two reasons:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 1. In general, OMPI doesn't really care what name
>>>>>>>>>>>>>>>>>>> is used for the node. However, the problem is that
>>>>>>>>>>>>>>>>>>> it needs to be consistent. In this case, ORTE has
>>>>>>>>>>>>>>>>>>> already used the name returned by gethostname to
>>>>>>>>>>>>>>>>>>> create its session directory structure long before
>>>>>>>>>>>>>>>>>>> mpirun reads a hostfile. This is why we retain the
>>>>>>>>>>>>>>>>>>> value from gethostname instead of allowing it to
>>>>>>>>>>>>>>>>>>> be overwritten by the name in whatever allocation
>>>>>>>>>>>>>>>>>>> we are given. Using the name in hostfile would
>>>>>>>>>>>>>>>>>>> require that I either find some way to remember
>>>>>>>>>>>>>>>>>>> any prior name, or that I tear down and rebuild
>>>>>>>>>>>>>>>>>>> the session directory tree - neither seems
>>>>>>>>>>>>>>>>>>> attractive nor simple (e.g., what happens when the
>>>>>>>>>>>>>>>>>>> user provides multiple entries in the hostfile for
>>>>>>>>>>>>>>>>>>> the node, each with a different IP address based
>>>>>>>>>>>>>>>>>>> on another interface in that node? Sounds crazy,
>>>>>>>>>>>>>>>>>>> but we have already seen it done - which one do I
>>>>>>>>>>>>>>>>>>> use?).
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 2. We don't actually store the hostfile info
>>>>>>>>>>>>>>>>>>> anywhere - we just use it and forget it. For us to
>>>>>>>>>>>>>>>>>>> add an XML attribute containing any hostfile-
>>>>>>>>>>>>>>>>>>> related info would therefore require us to re-read
>>>>>>>>>>>>>>>>>>> the hostfile. I could have it do that -only- in
>>>>>>>>>>>>>>>>>>> the case of "XML output required", but it seems
>>>>>>>>>>>>>>>>>>> rather ugly.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> An alternative might be for you to simply do a
>>>>>>>>>>>>>>>>>>> "gethostbyname" lookup of the IP address or
>>>>>>>>>>>>>>>>>>> hostname to see if it matches instead of just
>>>>>>>>>>>>>>>>>>> doing a strcmp. This is what we have to do
>>>>>>>>>>>>>>>>>>> internally as we frequently have problems with
>>>>>>>>>>>>>>>>>>> FQDN vs. non-FQDN vs. IP addresses etc. If the
>>>>>>>>>>>>>>>>>>> local OS hasn't cached the IP address for the node
>>>>>>>>>>>>>>>>>>> in question it can take a little time to DNS
>>>>>>>>>>>>>>>>>>> resolve it, but otherwise works fine.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I can point you to the code in OPAL that we use -
>>>>>>>>>>>>>>>>>>> I would think something similar would be easy to
>>>>>>>>>>>>>>>>>>> implement in your code and would readily solve the
>>>>>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Ralph
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Sep 19, 2008, at 7:18 AM, Greg Watson wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Ralph,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The problem we're seeing is just with the head
>>>>>>>>>>>>>>>>>>>> node. If I specify a particular IP address for
>>>>>>>>>>>>>>>>>>>> the head node in the hostfile, it gets changed to
>>>>>>>>>>>>>>>>>>>> the FQDN when displayed in the map. This is a
>>>>>>>>>>>>>>>>>>>> problem for us as we need to be able to match the
>>>>>>>>>>>>>>>>>>>> two, and since we're not necessarily running on
>>>>>>>>>>>>>>>>>>>> the head node, we can't always do the same
>>>>>>>>>>>>>>>>>>>> resolution you're doing.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Would it be possible to use the same address that
>>>>>>>>>>>>>>>>>>>> is specified in the hostfile, or alternatively
>>>>>>>>>>>>>>>>>>>> provide an XML attribute that contains this
>>>>>>>>>>>>>>>>>>>> information?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Greg
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Sep 11, 2008, at 9:06 AM, Ralph Castain wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Not in that regard, depending upon what you mean
>>>>>>>>>>>>>>>>>>>>> by "recently". The only changes I am aware of
>>>>>>>>>>>>>>>>>>>>> wrt nodes consisted of some changes to the order
>>>>>>>>>>>>>>>>>>>>> in which we use the nodes when specified by
>>>>>>>>>>>>>>>>>>>>> hostfile or -host, and a little #if
>>>>>>>>>>>>>>>>>>>>> protectionism needed by Brian for the Cray port.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Are you seeing this for every node? Reason I
>>>>>>>>>>>>>>>>>>>>> ask: I can't offhand think of anything in the
>>>>>>>>>>>>>>>>>>>>> code base that would replace a host name with
>>>>>>>>>>>>>>>>>>>>> the FQDN because we don't get that info for
>>>>>>>>>>>>>>>>>>>>> remote nodes. The only exception is the head
>>>>>>>>>>>>>>>>>>>>> node (where mpirun sits) - in that lone case, we
>>>>>>>>>>>>>>>>>>>>> default to the name returned to us by
>>>>>>>>>>>>>>>>>>>>> gethostname(). We do that because the head node
>>>>>>>>>>>>>>>>>>>>> is frequently accessible on a more global basis
>>>>>>>>>>>>>>>>>>>>> than the compute nodes - thus, the FQDN is
>>>>>>>>>>>>>>>>>>>>> required to ensure that there is no address
>>>>>>>>>>>>>>>>>>>>> confusion on the network.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> If the user refers to compute nodes in a
>>>>>>>>>>>>>>>>>>>>> hostfile or -host (or in an allocation from a
>>>>>>>>>>>>>>>>>>>>> resource manager) by non-FQDN, we just assume
>>>>>>>>>>>>>>>>>>>>> they know what they are doing and the name will
>>>>>>>>>>>>>>>>>>>>> correctly resolve to a unique address.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Sep 10, 2008, at 9:45 AM, Greg Watson wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Has there been a change in the behavior of the -
>>>>>>>>>>>>>>>>>>>>>> display-map option has changed recently in the
>>>>>>>>>>>>>>>>>>>>>> 1.3 branch. We're now seeing the host name as a
>>>>>>>>>>>>>>>>>>>>>> fully resolved DN rather than the entry that
>>>>>>>>>>>>>>>>>>>>>> was specified in the hostfile. Is there any
>>>>>>>>>>>>>>>>>>>>>> particular reason for this? If so, would it be
>>>>>>>>>>>>>>>>>>>>>> possible to add the hostfile entry to the
>>>>>>>>>>>>>>>>>>>>>> output since we need to be able to match the two?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Greg
>>>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/
>>>>>>>>>>>>>>>>>>>>>> devel
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> devel mailing list
>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> devel_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> devel_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
Cisco Systems