I need to correct myself on something here...see below.
On 7/20/07 8:31 AM, "Ralph Castain" <rhc_at_[hidden]> wrote:
> On 7/20/07 8:13 AM, "Rolf.Vandevaart_at_[hidden]" <Rolf.Vandevaart_at_[hidden]>
>> Ralph brings up some good points here. I have a few thoughts/experiences.
>> First, I like the way things are behaving now. In fact, I take full
>> of the fact the different aliases for a node are treated as different nodes
>> to do some scalability testing. It is in this way that I fake out the
>> ORTE and
>> have it start multiple daemons on a node. (We had a similar feature in our
>> old ClusterTools runtime environment to get multiple daemons running
>> on a single node)
>> For example, I do this to get 4 orteds running on "alachua".
>> mpirun -np 4 -host alachua,alachua-1,alachua-2,alachua-3 hostname
>> All of the above resolve to the same IP address.
>> Secondly, I would not want us to make any change that negatively affects
>> scalability. If we do decide to make a change, then we need a flag to
>> revert to the original behaviour.
>> Lastly, I guess I have two questions.
>> 1. Are you sure that Open MPI behaves in "unexpected ways?" This all
>> worked fine for me as I stated above.
> Here's the problem. The system sends its launch message to every daemon.
> That message specifies what exec to run on each *node* - not what exec each
> daemon should run since we only map to nodes and expect only one daemon to
> be on that node. So if you have multiple daemons sitting on the same node,
> they each will launch a copy of the procs for that node.
> In your case, hostname couldn't care less so it isn't a problem. However,
> for an MPI job, you would now have multiple procs sharing the same name -
> which causes havoc.
Actually, in this case, it won't matter either way. The reason is that ORTE
actually thinks these are different nodes, so the launch message will get
interpreted correctly by the different daemons - so you'll only get the
proper number of procs launched.
You'll still have an issue with oversubscription, but the system should
otherwise work okay. So, other than that one caveat, I have to retract my
comment about "behaving in unexpected ways" - this only happened when we
were mistakenly launching multiple daemons on the same node and we actually
thought they were the same node (during devel - don't think this is
happening on the trunk).
Sorry for confusion...
>> 2. Do you have any more details on the cost of "resolving every name"?
>> Which API is it that causes the problems? I only ask because I have
>> been trying to understand some of the NIS traffic I see when running
>> on my cluster.
> I honestly don't recall details at the moment - it was a couple of years ago
> when we last tried that option. If I recall correctly (Jeff's infamous
> IIRC), it involved doing a dns_lookup on every hostname, which meant that
> the HNP was banging away on your local dns server. This would take a few
> seconds for a few tens of nodes due to traffic contention at the dns server
> on some of the clusters we were using at the time, so there was concern over
> But I may be mis-remembering. If someone can/wants to run a quick test code
> to measure the time required, that might be useful info. My guess, though,
> is that this might have scaling issues. Again, we could only require this in
> specific cases - maybe when we have -host specified? Just fishing here...
>> Ralph Castain wrote:
>>> Yo all
>>> A recent email thread on the devel list involved (in part) the question of
>>> hostname resolution. [Note: I have a fix for the localhost problem described
>>> in that thread - just need to chase down a memory corruption problem, so it
>>> won't come into the trunk until next week]
>>> This is a problem that has troubled us since the beginning, and we have gone
>>> back-and-forth on solutions. Rather than just throwing another code change
>>> into the system, Jeff and I thought it might be a good idea to seek input
>>> from the community.
>>> The problem is that our system requires a consistent way of identifying
>>> nodes so we can tell if, for example, we already have a daemon on that node.
>>> We currently do that via a string hostname. This appears to work just fine
>>> in managed environments as the allocators are (usually?) consistent in how
>>> they name a node.
>>> However, users are frequently not consistent, which causes a problem. For
>>> example, users can create a hostfile entry for "foo.bar.net", and then put
>>> "-host foo" on their command line. In Open MPI, these will be treated as two
>>> completely separate nodes.
>>> In the past, we attempted to solve this by actually resolving every name
>>> provided to us. However, resolving names of remote hosts can be a very
>>> expensive function call, especially at scale. One solution we considered was
>>> to only do this for non-managed environments - i.e., when provided names in
>>> a hostfile or via -host. This was rejected on the grounds that it penalized
>>> people who used those mechanisms and, in many cases, wasn't necessary
>>> because users were careful to avoid ambiguity.
>>> But that leaves us with an unsolved problem that can cause Open MPI to
>>> behave in unexpected ways, including possibly hanging. Of course, we could
>>> just check names for matches in that first network name field - this would
>>> solve the "foo" vs "foo.bar.net" problem, but creates a vulnerability (what
>>> if we have both "foo.bar.net" and "foo.no-bar.net" in our hostfile?) that
>>> may or may not be acceptable (I'm sure it is at least uncommon for an MPI
>>> app to cross subnet boundaries, but maybe someone is really doing this in
>>> some rsh-based cluster).
>>> Or we could go back to fully resolving names provided via non-managed
>>> channels. Or we just tell people that "you *must* be consistent in how you
>>> identify nodes". Or....?
>>> Any input would be appreciated.
>>> devel mailing list
>> devel mailing list
> devel mailing list