This web mail archive is frozen.
This page is part of a frozen web archive of this mailing list.
You can still navigate around this archive, but know that no new mails
have been added to it since July of 2016.
Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.
On 1/29/2014 11:30 PM, Ralph Castain wrote:
> On Jan 29, 2014, at 7:56 PM, Victor <victor.major_at_[hidden]
> <mailto:victor.major_at_[hidden]>> wrote:
>> Thanks for the insights Tim. I was aware that the CPUs will choke
>> beyond a certain point. From memory on my machine this happens with 5
>> concurrent MPI jobs with that benchmark that I am using.
>> My primary question was about scaling between the nodes. I was not
>> getting close to double the performance when running MPI jobs acros
>> two 4 core nodes. It may be better now since I have Open-MX in place,
>> but I have not repeated the benchmarks yet since I need to get one
>> simulation job done asap.
> Some of that may be due to expected loss of performance when you
> switch from shared memory to inter-node transports. While it is true
> about saturation of the memory path, what you reported could be more
> consistent with that transition - i.e., it isn't unusual to see
> applications perform better when run on a single node, depending upon
> how they are written, up to a certain size of problem (which your code
> may not be hitting).
>> Regarding your mention of setting affinities and MPI ranks do you
>> have a specific (as in syntactically specific since I am a novice and
>> easily confused...) examples how I may want to set affinities to get
>> the Westmere node performing better?
> mpirun --bind-to-core -cpus-per-rank 2 ...
> will bind each MPI rank to 2 cores. Note that this will definitely
> *not* be a good idea if you are running more than two threads in your
> process - if you are, then set --cpus-per-rank to the number of
> threads, keeping in mind that you want things to break evenly across
> the sockets. In other words, if you have two 6 core/socket Westmere's
> on the node, then you either want to run 6 process at cpus-per-rank=2
> if each process runs 2 threads, or 4 processes with cpus-per-rank=3 if
> each process runs 3 threads, or 2 processes with no cpus-per-rank but
> --bind-to-socket instead of --bind-to-core for any other thread number
> > 3.
> You would not want to run any other number of processes on the node or
> else the binding pattern will cause a single process to split its
> threads across the sockets - which will definitely hurt performance.
-cpus-per-rank 2 is an effective choice for this platform. As Ralph
said, it should work automatically for 2 threads per rank.
Ralph's point about not splitting a process across sockets is an
important one. Even splitting a process across internal busses, which
would happen with 3 threads per process, seems problematical.