Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] -display-map
From: Greg Watson (g.watson_at_[hidden])
Date: 2008-10-22 13:55:02


Ralph,

I guess the issue for us is that we will have to run two commands to
get the information we need. One to get the configuration information,
such as version and MCA parameters, and one to get the host
information, whereas it would seem more logical that this should all
be available via some kind of "configuration discovery" command. I
understand the issue with supplying the hostfile though, so maybe this
just points at the need for us to separate configuration information
from the host information. In any case, we'll work with what you think
is best.

Greg

On Oct 20, 2008, at 4:49 PM, Ralph Castain wrote:

> Hmmm...just to be sure we are all clear on this. The reason we
> proposed to use mpirun is that "hostfile" has no meaning outside of
> mpirun. That's why ompi_info can't do anything in this regard.
>
> We have no idea what hostfile the user may specify until we actually
> get the mpirun cmd line. They may have specified a default-hostfile,
> but they could also specify hostfiles for the individual
> app_contexts. These may or may not include the node upon which
> mpirun is executing.
>
> So the only way to provide you with a separate command to get a
> hostfile<->nodename mapping would require you to provide us with the
> default-hostifle and/or hostfile cmd line options just as if you
> were issuing the mpirun cmd. We just wouldn't launch - but it would
> be the exact equivalent of doing "mpirun --do-not-launch".
>
> Am I missing something? If so, please do correct me - I would be
> happy to provide a tool if that would make it easier. Just not sure
> what that tool would do.
>
> Thanks
> Ralph
>
>
> On Oct 19, 2008, at 1:59 PM, Greg Watson wrote:
>
>> Ralph,
>>
>> It seems a little strange to be using mpirun for this, but barring
>> providing a separate command, or using ompi_info, I think this
>> would solve our problem.
>>
>> Thanks,
>>
>> Greg
>>
>> On Oct 17, 2008, at 10:46 AM, Ralph Castain wrote:
>>
>>> Sorry for delay - had to ponder this one for awhile.
>>>
>>> Jeff and I agree that adding something to ompi_info would not be a
>>> good idea. Ompi_info has no knowledge or understanding of
>>> hostfiles, and adding that capability to it would be a major
>>> distortion of its intended use.
>>>
>>> However, we think we can offer an alternative that might better
>>> solve the problem. Remember, we now treat hostfiles in a very
>>> different manner than before - see the wiki page for a complete
>>> description, or "man orte_hosts".
>>>
>>> So the problem is that, to provide you with what you want, we need
>>> to "dump" the information from whatever default-hostfile was
>>> provided, and, if no default-hostfile was provided, then the
>>> information from each hostfile that was provided with an
>>> app_context.
>>>
>>> The best way we could think of to do this is to add another mpirun
>>> cmd line option --dump-hostfiles that would output the line-by-
>>> line name from the hostfile plus the name we resolved it to. Of
>>> course, --xml would cause it to be in xml format.
>>>
>>> Would that meet your needs?
>>>
>>> Ralph
>>>
>>>
>>> On Oct 15, 2008, at 3:12 PM, Greg Watson wrote:
>>>
>>>> Hi Ralph,
>>>>
>>>> We've been discussing this back and forth a bit internally and
>>>> don't really see an easy solution. Our problem is that Eclipse is
>>>> not running on the head node, so gethostbyname will not
>>>> necessarily resolve to the same address. For example, the
>>>> hostfile might refer to the head node by an internal network
>>>> address that is not visible to the outside world. Since
>>>> gethostname also looks in /etc/hosts, it may resolve locally but
>>>> not on a remote system. The only think I can think of would be,
>>>> rather than us reading the hostfile directly as we do now, to
>>>> provide an option to ompi_info that would dump the hostfile using
>>>> the same rules that you apply when you're using the hostfile.
>>>> Would that be feasible?
>>>>
>>>> Greg
>>>>
>>>> On Sep 22, 2008, at 4:25 PM, Ralph Castain wrote:
>>>>
>>>>> Sorry for delay - was on vacation and am now trying to work my
>>>>> way back to the surface.
>>>>>
>>>>> I'm not sure I can fix this one for two reasons:
>>>>>
>>>>> 1. In general, OMPI doesn't really care what name is used for
>>>>> the node. However, the problem is that it needs to be
>>>>> consistent. In this case, ORTE has already used the name
>>>>> returned by gethostname to create its session directory
>>>>> structure long before mpirun reads a hostfile. This is why we
>>>>> retain the value from gethostname instead of allowing it to be
>>>>> overwritten by the name in whatever allocation we are given.
>>>>> Using the name in hostfile would require that I either find some
>>>>> way to remember any prior name, or that I tear down and rebuild
>>>>> the session directory tree - neither seems attractive nor simple
>>>>> (e.g., what happens when the user provides multiple entries in
>>>>> the hostfile for the node, each with a different IP address
>>>>> based on another interface in that node? Sounds crazy, but we
>>>>> have already seen it done - which one do I use?).
>>>>>
>>>>> 2. We don't actually store the hostfile info anywhere - we just
>>>>> use it and forget it. For us to add an XML attribute containing
>>>>> any hostfile-related info would therefore require us to re-read
>>>>> the hostfile. I could have it do that -only- in the case of "XML
>>>>> output required", but it seems rather ugly.
>>>>>
>>>>> An alternative might be for you to simply do a "gethostbyname"
>>>>> lookup of the IP address or hostname to see if it matches
>>>>> instead of just doing a strcmp. This is what we have to do
>>>>> internally as we frequently have problems with FQDN vs. non-FQDN
>>>>> vs. IP addresses etc. If the local OS hasn't cached the IP
>>>>> address for the node in question it can take a little time to
>>>>> DNS resolve it, but otherwise works fine.
>>>>>
>>>>> I can point you to the code in OPAL that we use - I would think
>>>>> something similar would be easy to implement in your code and
>>>>> would readily solve the problem.
>>>>>
>>>>> Ralph
>>>>>
>>>>> On Sep 19, 2008, at 7:18 AM, Greg Watson wrote:
>>>>>
>>>>>> Ralph,
>>>>>>
>>>>>> The problem we're seeing is just with the head node. If I
>>>>>> specify a particular IP address for the head node in the
>>>>>> hostfile, it gets changed to the FQDN when displayed in the
>>>>>> map. This is a problem for us as we need to be able to match
>>>>>> the two, and since we're not necessarily running on the head
>>>>>> node, we can't always do the same resolution you're doing.
>>>>>>
>>>>>> Would it be possible to use the same address that is specified
>>>>>> in the hostfile, or alternatively provide an XML attribute that
>>>>>> contains this information?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Greg
>>>>>>
>>>>>> On Sep 11, 2008, at 9:06 AM, Ralph Castain wrote:
>>>>>>
>>>>>>> Not in that regard, depending upon what you mean by
>>>>>>> "recently". The only changes I am aware of wrt nodes consisted
>>>>>>> of some changes to the order in which we use the nodes when
>>>>>>> specified by hostfile or -host, and a little #if protectionism
>>>>>>> needed by Brian for the Cray port.
>>>>>>>
>>>>>>> Are you seeing this for every node? Reason I ask: I can't
>>>>>>> offhand think of anything in the code base that would replace
>>>>>>> a host name with the FQDN because we don't get that info for
>>>>>>> remote nodes. The only exception is the head node (where
>>>>>>> mpirun sits) - in that lone case, we default to the name
>>>>>>> returned to us by gethostname(). We do that because the head
>>>>>>> node is frequently accessible on a more global basis than the
>>>>>>> compute nodes - thus, the FQDN is required to ensure that
>>>>>>> there is no address confusion on the network.
>>>>>>>
>>>>>>> If the user refers to compute nodes in a hostfile or -host (or
>>>>>>> in an allocation from a resource manager) by non-FQDN, we just
>>>>>>> assume they know what they are doing and the name will
>>>>>>> correctly resolve to a unique address.
>>>>>>>
>>>>>>>
>>>>>>> On Sep 10, 2008, at 9:45 AM, Greg Watson wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Has there been a change in the behavior of the -display-map
>>>>>>>> option has changed recently in the 1.3 branch. We're now
>>>>>>>> seeing the host name as a fully resolved DN rather than the
>>>>>>>> entry that was specified in the hostfile. Is there any
>>>>>>>> particular reason for this? If so, would it be possible to
>>>>>>>> add the hostfile entry to the output since we need to be able
>>>>>>>> to match the two?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Greg
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> devel_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>