Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Hugh Merz (merz_at_[hidden])
Date: 2006-11-08 09:57:18

On Wed, 8 Nov 2006, Larry Stewart wrote:
> Miguel Figueiredo Mascarenhas Sousa Filipe wrote:
>> H
>>>> the MPI model assumes you don't have a "shared memory" system..
>>>> therefore it is "message passing" oriented, and not designed to
>>>> perform optimally on shared memory systems (like SMPs, or numa-CCs).
>>> For many programs with both MPI and shared memory implementations, the
>>> MPI version runs faster on SMPs and numa-CCs. Why? See the previous
>>> paragraph...
>> Of course it does..its faster to copy data in main memory than it is
>> to do it thought any kind of network interface. You can optimize you
>> message passing implementation to a couple of memory to memory copies
>> when ranks are on the same node. In the worst case, even if using
>> local IP addresses to communicate between peers/ranks (in the same
>> node), the operating system doesn't even touch the interface.. it
>> will just copy data from a tcp sender buffer to a tcp receiver
>> buffer.. in the end - that's always faster than going through a
>> phisical network link.
> There are a lot of papers about the relative merits of a mixed shared
> memory and
> MPI model - OpenMP on-node and MPI inter-node, for example. Generally they
> seem to show that MPI is at least as good.

The conventional wisdom of pure MPI being as good as hybrid models is primarily driven by the fact that people haven't had much incentive to re-write their algorithms to support both models. It's a lot easier to focus only on MPI, hence the limited (and lightly tested) support for MPI_THREAD_MULTIPLE and asynchronous progress in Open-MPI.

If current HPC trends continue into the future there is going to be increased motivation to implement fine-grained parallelism in addition to MPI. As an example, the amount of RAM/node doesn't seem to be increasing as fast as the number of cores/node, so pure MPI codes which use a significant amount of memory for buffers (domain decomposition algorithms are a good example) will not scale to as large of a problem size as hybrid implementations in weak-scaling scenarios.