On Oct 30, 2012, at 9:51 AM, Hodge, Gary C wrote:
> FYI, recently, I was tracking down the source of page faults in our application that has real-time requirements. I found that disabling the sm component (--mca btl ^sm) eliminated many page faults I was seeing.
Good point. This is likely true; the shared memory component will definitely cause more page faults. Using huge pages may alleviate this (e.g., less TLB usage), but we haven't studied it much.
> I now have much better deterministic performance in that I no longer see outlier measurements (jobs that usually take 3 ms would sometimes take 15 ms).
I'm not sure I grok that; are you benchmarking an entire *job* (i.e., a single "mpirun") that varies between 3 and 15 milliseconds? If so, I'd say that both are pretty darn good, because mpirun invokes a lot of overhead for launching and completing jobs. Furthermore, benchmarking an entire job that lasts significantly less than 1 second is probably not the most stable measurement, regardless of page faults or not -- there's lots of other distributed and OS effects that can cause a jump from 3 to 15 milliseconds.
> I did not notice a performance penalty using a network stack.
Depends on the app. Some MPI apps are latency bound; some are not.
Latency-bound applications will definitely benefit from faster point-to-point performance. Shared memory will definitely have the fastest point-to-point latency compared to any network stack (i.e., hundreds of nanos vs. 1+ micro).
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/