On 12/11/2011 12:16 PM, Andreas Schäfer wrote:
> on an SMP box threaded codes CAN always be faster than their MPI
> equivalents. One reason why MPI sometimes turns out to be faster is
> that with MPI every process actually initializes its own
> data. Therefore it'll end up in the NUMA domain to which the core
> running that process belongs. A lot of threaded codes are not NUMA
> aware. So, for instance the initialization is done sequentially
> (because it may not take a lot of time), and Linux' first touch policy
> makes all memory pages belong to a single domain. In essence, those
> codes will use just a single memory controller (and its bandwidth).
Many applications require significant additional RAM and message passing
communication per MPI rank. Where those are not adverse issues, MPI is
likely to out-perform pure OpenMP (Andreas just quoted some of the
reasons), and OpenMP is likely to be favored only where it is an easier
development model. The OpenMP library also should implement a
first-touch policy, but it's very difficult to carry out fully in legacy
OpenMPI has had effective shared memory message passing from the
beginning, as did its predecessor (LAM) and all current commercial MPI
implementations I have seen, so you shouldn't have to beat on an issue
which was dealt with 10 years ago. If you haven't been watching this
mail list, you've missed some impressive reporting of new support
features for effective pinning by CPU, cache, etc.
When you get to hundreds of nodes, depending on your application and
interconnect performance, you may need to consider "hybrid" (OpenMP as
the threading model for MPI_THREAD_FUNNELED mode), if you are running a
single application across the entire cluster.
The biggest cluster in my neighborhood, which ranked #54 on the recent
Top500, gave best performance in pure MPI mode for that ranking. It
uses FDR infiniband, and ran 16 ranks per node, for 646 nodes, with
DGEMM running in 4-wide vector parallel. Hybrid was tested as well,
with each multiple-thread rank pinned to a single L3 cache.
All 3 MPI implementations which were tested have full shared memory
message passing and pinning to local cache within each node (OpenMPI and
2 commercial MPIs).