This web mail archive is frozen.
This page is part of a frozen web archive of this mailing list.
You can still navigate around this archive, but know that no new mails
have been added to it since July of 2016.
Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.
On Mon, Jul 13, 2009 at 01:24:54PM -0400, Mark Borgerding wrote:
> Here's my advice: Don't trust anyones advice. Benchmark it yourself and
> The problems vary so wildly that only you can tell if your problem will
> benefit from over-subscription. It really depends on too many factors to
> accurately predict: schedulers, memory usage, network/interconnect
> hardware, disk seek times, and probably a hundred other things.
> I've even seen mixed results from oversubscribing within a single
> algorithm. (Granted this is mostly with the older generation
> hyperthreading, so I'm not sure how things fare with nehalem). The most
> notable effect I've observed is related to cache use. If the problem
> fits in cache it is much faster. With cores sharing cache it can even
> be advantageous to *undersubscribe* the problem. i.e. schedule 2
> processes on a quad core so each can have the full cache.
Mark's advice - stellar- "Benchmark it yourself and see".
I suspect that a number of interesting things are hidden under
- application chunk sizing.
- application chunk symmetry.
- cache interactions.
- cache line conflicts
- MPI primitives
- MPI message rate interactions
- MPI bandwidth interactions
- MPI latency interactions
- barrier code used in MPI primitives
- mutex code
- Communication hardware interactions
- Compiler optimizations
- Compiler pipelineing
- Compiler flags
- Compiler loop unrolling
- Compiler SIMD instruction use.
- Compiler intrinsic
- Library selection and implementation
- System API choice.
- hardware pipeline use while hyperthreading is active.
A naive view of Intel Hyperthreading transistor counts
makes it economical to share some pipelines between
two execution streams. In old code and common application
mixes fully replicated hardware would be idle a lot
of the time.
At Pathscale MPI benchmarks between the in-house compiler and
other modern optimizing compilers were not done with hyperthreading
enabled because it was routinely slower on interesting benchmarks
(and it required a BIOS change). YMMV, What is interesting to
you might be different so try it.
AND at a system level hyperthreading is very interesting
because of stuff like IO, X and numerous kernel tasks
do not need or touch the big blocks of shared transistors that
are the floating point hardware.
T o m M i t c h e l l
Found me a new hat, now what?