On Oct 30, 2012, at 09:57 , Jeff Squyres <jsquyres_at_[hidden]> wrote:
> On Oct 30, 2012, at 9:51 AM, Hodge, Gary C wrote:
>> FYI, recently, I was tracking down the source of page faults in our application that has real-time requirements. I found that disabling the sm component (--mca btl ^sm) eliminated many page faults I was seeing.
> Good point. This is likely true; the shared memory component will definitely cause more page faults. Using huge pages may alleviate this (e.g., less TLB usage), but we haven't studied it much.
This will depend on the communication pattern of the application and the size of the messages. A rise in the number of page faults is not a normal behavior and it is mostly unexpected in most of the common execution scenarios. We reuse the memory pages in the SM BTL, minimizing the page faults as well as the TLB misses.
If the sharp increase in the number of page fault is indeed to be blamed on the SM BTL, this is more than worrisome, as it might in indicate a wrong usage of the reserved memory pages (like a FIFO instead of a LIFO). Can you provide us with more precise information regarding this please.
>> I now have much better deterministic performance in that I no longer see outlier measurements (jobs that usually take 3 ms would sometimes take 15 ms).
> I'm not sure I grok that; are you benchmarking an entire *job* (i.e., a single "mpirun") that varies between 3 and 15 milliseconds? If so, I'd say that both are pretty darn good, because mpirun invokes a lot of overhead for launching and completing jobs. Furthermore, benchmarking an entire job that lasts significantly less than 1 second is probably not the most stable measurement, regardless of page faults or not -- there's lots of other distributed and OS effects that can cause a jump from 3 to 15 milliseconds.
>> I did not notice a performance penalty using a network stack.
> Depends on the app. Some MPI apps are latency bound; some are not.
> Latency-bound applications will definitely benefit from faster point-to-point performance. Shared memory will definitely have the fastest point-to-point latency compared to any network stack (i.e., hundreds of nanos vs. 1+ micro).
> Jeff Squyres
> For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
> users mailing list