On 1/29/2014 11:30 PM, Ralph Castain wrote:
> On Jan 29, 2014, at 7:56 PM, Victor <victor.major_at_[hidden]
> <mailto:victor.major_at_[hidden]>> wrote:
>> Thanks for the insights Tim. I was aware that the CPUs will choke
>> beyond a certain point. From memory on my machine this happens with 5
>> concurrent MPI jobs with that benchmark that I am using.
>> My primary question was about scaling between the nodes. I was not
>> getting close to double the performance when running MPI jobs acros
>> two 4 core nodes. It may be better now since I have Open-MX in place,
>> but I have not repeated the benchmarks yet since I need to get one
>> simulation job done asap.
> Some of that may be due to expected loss of performance when you
> switch from shared memory to inter-node transports. While it is true
> about saturation of the memory path, what you reported could be more
> consistent with that transition - i.e., it isn't unusual to see
> applications perform better when run on a single node, depending upon
> how they are written, up to a certain size of problem (which your code
> may not be hitting).
>> Regarding your mention of setting affinities and MPI ranks do you
>> have a specific (as in syntactically specific since I am a novice and
>> easily confused...) examples how I may want to set affinities to get
>> the Westmere node performing better?
> mpirun --bind-to-core -cpus-per-rank 2 ...
> will bind each MPI rank to 2 cores. Note that this will definitely
> *not* be a good idea if you are running more than two threads in your
> process - if you are, then set --cpus-per-rank to the number of
> threads, keeping in mind that you want things to break evenly across
> the sockets. In other words, if you have two 6 core/socket Westmere's
> on the node, then you either want to run 6 process at cpus-per-rank=2
> if each process runs 2 threads, or 4 processes with cpus-per-rank=3 if
> each process runs 3 threads, or 2 processes with no cpus-per-rank but
> --bind-to-socket instead of --bind-to-core for any other thread number
> > 3.
> You would not want to run any other number of processes on the node or
> else the binding pattern will cause a single process to split its
> threads across the sockets - which will definitely hurt performance.
-cpus-per-rank 2 is an effective choice for this platform. As Ralph
said, it should work automatically for 2 threads per rank.
Ralph's point about not splitting a process across sockets is an
important one. Even splitting a process across internal busses, which
would happen with 3 threads per process, seems problematical.