Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] EXTERNAL: Re: openmpi shared memory feature
From: Hodge, Gary C (gary.c.hodge_at_[hidden])
Date: 2012-10-30 11:46:25


Our measurements are not for the entire mpirun job, rather they are for the time it takes to process a message through our processing pipeline consisting of 11 processes distributed over 8 nodes. Taking an extra microsecond here and there is better for us than jumping from 3 to 15 ms because this is when we cross a hard real-time limit

-----Original Message-----
From: Jeff Squyres [mailto:jsquyres_at_[hidden]]
Sent: Tuesday, October 30, 2012 9:57 AM
To: Hodge, Gary C
Cc: Mahmood Naderan; Open MPI Users
Subject: Re: EXTERNAL: Re: [OMPI users] openmpi shared memory feature

On Oct 30, 2012, at 9:51 AM, Hodge, Gary C wrote:

> FYI, recently, I was tracking down the source of page faults in our application that has real-time requirements. I found that disabling the sm component (--mca btl ^sm) eliminated many page faults I was seeing.

Good point. This is likely true; the shared memory component will definitely cause more page faults. Using huge pages may alleviate this (e.g., less TLB usage), but we haven't studied it much.

> I now have much better deterministic performance in that I no longer see outlier measurements (jobs that usually take 3 ms would sometimes take 15 ms).

I'm not sure I grok that; are you benchmarking an entire *job* (i.e., a single "mpirun") that varies between 3 and 15 milliseconds? If so, I'd say that both are pretty darn good, because mpirun invokes a lot of overhead for launching and completing jobs. Furthermore, benchmarking an entire job that lasts significantly less than 1 second is probably not the most stable measurement, regardless of page faults or not -- there's lots of other distributed and OS effects that can cause a jump from 3 to 15 milliseconds.

> I did not notice a performance penalty using a network stack.

Depends on the app. Some MPI apps are latency bound; some are not.

Latency-bound applications will definitely benefit from faster point-to-point performance. Shared memory will definitely have the fastest point-to-point latency compared to any network stack (i.e., hundreds of nanos vs. 1+ micro).

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/