We move 40K and 160K size messages from process to process on the same node. Our app does mlock(MCL_CURRENT | MCL_FUTURE) before MPI_INIT.
I measure the page faults using getrusage and record when they increase. I observe increasing ru_minflt values and no ru_majflt increase.
Increased values reported are 40, 80, or 120; our page size is 4K. The page reclaims/faults are checked after MPI receive processing,
after our application processing, and after MPI send processing. Our application processing is not the source of increasing reclaims/faults.
I observe the disk I/O light flashing on nodes when we report increasing reclaims/faults.
When I turn off the SM BTL, the reclaims stop increasing and the disk I/O light does not blink.
From: George Bosilca [mailto:bosilca_at_[hidden]]
Sent: Thursday, November 01, 2012 12:25 AM
To: Open MPI Users
Cc: Hodge, Gary C
Subject: Re: [OMPI users] EXTERNAL: Re: openmpi shared memory feature
On Oct 30, 2012, at 09:57 , Jeff Squyres <jsquyres_at_[hidden]> wrote:
> On Oct 30, 2012, at 9:51 AM, Hodge, Gary C wrote:
>> FYI, recently, I was tracking down the source of page faults in our application that has real-time requirements. I found that disabling the sm component (--mca btl ^sm) eliminated many page faults I was seeing.
> Good point. This is likely true; the shared memory component will definitely cause more page faults. Using huge pages may alleviate this (e.g., less TLB usage), but we haven't studied it much.
This will depend on the communication pattern of the application and the size of the messages. A rise in the number of page faults is not a normal behavior and it is mostly unexpected in most of the common execution scenarios. We reuse the memory pages in the SM BTL, minimizing the page faults as well as the TLB misses.
If the sharp increase in the number of page fault is indeed to be blamed on the SM BTL, this is more than worrisome, as it might in indicate a wrong usage of the reserved memory pages (like a FIFO instead of a LIFO). Can you provide us with more precise information regarding this please.
>> I now have much better deterministic performance in that I no longer see outlier measurements (jobs that usually take 3 ms would sometimes take 15 ms).
> I'm not sure I grok that; are you benchmarking an entire *job* (i.e., a single "mpirun") that varies between 3 and 15 milliseconds? If so, I'd say that both are pretty darn good, because mpirun invokes a lot of overhead for launching and completing jobs. Furthermore, benchmarking an entire job that lasts significantly less than 1 second is probably not the most stable measurement, regardless of page faults or not -- there's lots of other distributed and OS effects that can cause a jump from 3 to 15 milliseconds.
>> I did not notice a performance penalty using a network stack.
> Depends on the app. Some MPI apps are latency bound; some are not.
> Latency-bound applications will definitely benefit from faster point-to-point performance. Shared memory will definitely have the fastest point-to-point latency compared to any network stack (i.e., hundreds of nanos vs. 1+ micro).
> Jeff Squyres
> For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
> users mailing list