I am getting some strange results when I enable the MCA parameters:
What happen is that for MPI programs which do lots of synchronization,
MPI_Barrier and MPI_Wait I get very good speedup (2.x) by turning on the
parameter (e.g. the CG benchmark of the NAS parallel benchmarks suite).
I am not oversubscribing nodes, I am running 8 processes in a SMP system
with exactly 8 physical cores (cache is shared on every 2 cores).
The only way I was explaining this result is because of temperature
issues that scale down the clock speed of the entire chip if all the
cores are getting too hot (because of the busy waiting). Anyway I tried
to replicate the behavior with a trivial (non MPI) code where one core
is doing some work while the others (belonging to the same chip) are
busy waiting but I didn't get the same speedup when I switch from the
busy wait to idle implementation.
Someone of you has any idea why is this happening?