Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] busy waiting and oversubscriptions
From: Tim Prince (n8tm_at_[hidden])
Date: 2014-03-26 08:04:13


On 3/26/2014 6:45 AM, Andreas Schäfer wrote:
> On 10:27 Wed 26 Mar , Jeff Squyres (jsquyres) wrote:
>> Be aware of a few facts, though:
>>
>> 1. There is a fundamental difference between disabling
>> hyperthreading in the BIOS at power-on time and simply running one
>> MPI process per core. Disabling HT at power-on allocates more
>> hardware resources to the remaining HT that is left is each core
>> (e.g., deeper queues).
> Oh, I didn't know that. That's interesting! Do you have any links with
> in-depth info on that?
>
>
On certain Intel CPUs, the full size instruction TLB was available to a
process when HyperThreading was disabled on the BIOS setup menu, and
that was the only way to make all the Write Combine buffers available to
a single process. Those CPUs are no longer in widespread use.

At one time, at Intel, we did a study to evaluate the net effect (on a
later CPU where this did not recover ITLB size). The result was buried
afterwards; possibly it didn't meet an unspecified marketing goal.
Typical applications ran 1% faster with HyperThreading disabled by BIOS
menu even with affinities carefully set to use just one process per
core. Not all applications showed a loss on all data sets when leaving
HT enabled.
There are a few MPI applications with specialized threading which could
gain 10% or more by use of HT.

In my personal opinion, SMT becomes less interesting as the number of
independent cores increases.
Intel(r) Xeon Phi(tm) is an exception, as the vector processing unit
issues instructions from a single thread only on alternate cycles. This
capability is used more effectively by running OpenMP threads under MPI,
e.g. 6 ranks per coprocessor of 30 threads each, spread across 10 cores
per rank (exact optimum depending on the application; MKL libraries use
all available hardware threads for sufficiently large data sets).

-- 
Tim Prince