Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Running on two nodes slower than running on one node
From: Victor (victor.major_at_[hidden])
Date: 2014-02-01 08:54:24

Thank you all for your help. --bind-to-core increased the cluster
performance by approximately 10%, so in addition to the improvements
through the implementation of Open-MX, the performance now scales within
expectations - not linear, but much better than with the original setup.

On 30 January 2014 20:43, Tim Prince <n8tm_at_[hidden]> wrote:

> On 1/29/2014 11:30 PM, Ralph Castain wrote:
> On Jan 29, 2014, at 7:56 PM, Victor <victor.major_at_[hidden]> wrote:
> Thanks for the insights Tim. I was aware that the CPUs will choke beyond
> a certain point. From memory on my machine this happens with 5 concurrent
> MPI jobs with that benchmark that I am using.
> My primary question was about scaling between the nodes. I was not
> getting close to double the performance when running MPI jobs acros two 4
> core nodes. It may be better now since I have Open-MX in place, but I have
> not repeated the benchmarks yet since I need to get one simulation job done
> asap.
> Some of that may be due to expected loss of performance when you switch
> from shared memory to inter-node transports. While it is true about
> saturation of the memory path, what you reported could be more consistent
> with that transition - i.e., it isn't unusual to see applications perform
> better when run on a single node, depending upon how they are written, up
> to a certain size of problem (which your code may not be hitting).
> Regarding your mention of setting affinities and MPI ranks do you have a
> specific (as in syntactically specific since I am a novice and easily
> confused...) examples how I may want to set affinities to get the Westmere
> node performing better?
> mpirun --bind-to-core -cpus-per-rank 2 ...
> will bind each MPI rank to 2 cores. Note that this will definitely *not*
> be a good idea if you are running more than two threads in your process -
> if you are, then set --cpus-per-rank to the number of threads, keeping in
> mind that you want things to break evenly across the sockets. In other
> words, if you have two 6 core/socket Westmere's on the node, then you
> either want to run 6 process at cpus-per-rank=2 if each process runs 2
> threads, or 4 processes with cpus-per-rank=3 if each process runs 3
> threads, or 2 processes with no cpus-per-rank but --bind-to-socket instead
> of --bind-to-core for any other thread number > 3.
> You would not want to run any other number of processes on the node or
> else the binding pattern will cause a single process to split its threads
> across the sockets - which will definitely hurt performance.
> -cpus-per-rank 2 is an effective choice for this platform. As Ralph
> said, it should work automatically for 2 threads per rank.
> Ralph's point about not splitting a process across sockets is an important
> one. Even splitting a process across internal busses, which would happen
> with 3 threads per process, seems problematical.
> --
> Tim Prince
> _______________________________________________
> users mailing list
> users_at_[hidden]