Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Ralph Castain (rhc_at_[hidden])
Date: 2007-07-20 10:31:02


On 7/20/07 8:13 AM, "Rolf.Vandevaart_at_[hidden]" <Rolf.Vandevaart_at_[hidden]>
wrote:

>
> Greetings:
> Ralph brings up some good points here. I have a few thoughts/experiences.
> First, I like the way things are behaving now. In fact, I take full
> advantage
> of the fact the different aliases for a node are treated as different nodes
> to do some scalability testing. It is in this way that I fake out the
> ORTE and
> have it start multiple daemons on a node. (We had a similar feature in our
> old ClusterTools runtime environment to get multiple daemons running
> on a single node)
>
> For example, I do this to get 4 orteds running on "alachua".
>
> mpirun -np 4 -host alachua,alachua-1,alachua-2,alachua-3 hostname
>
> All of the above resolve to the same IP address.
>
> Secondly, I would not want us to make any change that negatively affects
> scalability. If we do decide to make a change, then we need a flag to
> revert to the original behaviour.
>
> Lastly, I guess I have two questions.
> 1. Are you sure that Open MPI behaves in "unexpected ways?" This all
> worked fine for me as I stated above.

Here's the problem. The system sends its launch message to every daemon.
That message specifies what exec to run on each *node* - not what exec each
daemon should run since we only map to nodes and expect only one daemon to
be on that node. So if you have multiple daemons sitting on the same node,
they each will launch a copy of the procs for that node.

In your case, hostname couldn't care less so it isn't a problem. However,
for an MPI job, you would now have multiple procs sharing the same name -
which causes havoc.

> 2. Do you have any more details on the cost of "resolving every name"?
> Which API is it that causes the problems? I only ask because I have
> been trying to understand some of the NIS traffic I see when running
> on my cluster.

I honestly don't recall details at the moment - it was a couple of years ago
when we last tried that option. If I recall correctly (Jeff's infamous
IIRC), it involved doing a dns_lookup on every hostname, which meant that
the HNP was banging away on your local dns server. This would take a few
seconds for a few tens of nodes due to traffic contention at the dns server
on some of the clusters we were using at the time, so there was concern over
scalability.

But I may be mis-remembering. If someone can/wants to run a quick test code
to measure the time required, that might be useful info. My guess, though,
is that this might have scaling issues. Again, we could only require this in
specific cases - maybe when we have -host specified? Just fishing here...

>
> Thanks,
> Rolf
>
>
> Ralph Castain wrote:
>
>> Yo all
>>
>> A recent email thread on the devel list involved (in part) the question of
>> hostname resolution. [Note: I have a fix for the localhost problem described
>> in that thread - just need to chase down a memory corruption problem, so it
>> won't come into the trunk until next week]
>>
>> This is a problem that has troubled us since the beginning, and we have gone
>> back-and-forth on solutions. Rather than just throwing another code change
>> into the system, Jeff and I thought it might be a good idea to seek input
>> from the community.
>>
>> The problem is that our system requires a consistent way of identifying
>> nodes so we can tell if, for example, we already have a daemon on that node.
>> We currently do that via a string hostname. This appears to work just fine
>> in managed environments as the allocators are (usually?) consistent in how
>> they name a node.
>>
>> However, users are frequently not consistent, which causes a problem. For
>> example, users can create a hostfile entry for "foo.bar.net", and then put
>> "-host foo" on their command line. In Open MPI, these will be treated as two
>> completely separate nodes.
>>
>> In the past, we attempted to solve this by actually resolving every name
>> provided to us. However, resolving names of remote hosts can be a very
>> expensive function call, especially at scale. One solution we considered was
>> to only do this for non-managed environments - i.e., when provided names in
>> a hostfile or via -host. This was rejected on the grounds that it penalized
>> people who used those mechanisms and, in many cases, wasn't necessary
>> because users were careful to avoid ambiguity.
>>
>> But that leaves us with an unsolved problem that can cause Open MPI to
>> behave in unexpected ways, including possibly hanging. Of course, we could
>> just check names for matches in that first network name field - this would
>> solve the "foo" vs "foo.bar.net" problem, but creates a vulnerability (what
>> if we have both "foo.bar.net" and "foo.no-bar.net" in our hostfile?) that
>> may or may not be acceptable (I'm sure it is at least uncommon for an MPI
>> app to cross subnet boundaries, but maybe someone is really doing this in
>> some rsh-based cluster).
>>
>> Or we could go back to fully resolving names provided via non-managed
>> channels. Or we just tell people that "you *must* be consistent in how you
>> identify nodes". Or....?
>>
>> Any input would be appreciated.
>> Ralph
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel