Interesting data. Couple of quick points that might help:
option B is equivalent to --map-by node --bind-to none. When you bind to every core on the node, we don't bind you at all since "bind to all" is exactly equivalent to "bind to none". So it will definitely run slower as the threads run across the different NUMA regions on the node.
You might also want to try --map-by socket, with no binding directive. This would map one process to each socket, binding it to the socket - which is similar to what your option A actually accomplished. The only difference is that the procs that share a node will differ in rank by 1, whereas option A would have those procs differ in rank by N. Depending on your communication pattern, this could make a big difference.
Map-by socket is typically the fastest performance for threaded apps. You generally don't want P=1 unless you have a *lot* of threads in the process as it removes any use of shared memory, and so messaging will run slower - and you want the ranks that share a node to be the ones that most frequently communicate to each other, if you can identify them.
On Apr 10, 2014, at 5:59 PM, Saliya Ekanayake <esaliya_at_[hidden]> wrote:
> I am evaluating the performance of a clustering program written in Java with MPI+threads and would like to get some insight in solving a peculiar case. I've attached a performance graph to explain this.
> In essence the tests were carried out as TxPxN, where T is threads per process, P is processes per node, and N is number of nodes. I noticed an inefficiency with Tx1xN cases in general (tall bars in graph).
> To elaborate a bit further,
> 1. each node has 2 sockets with 4 cores each (totaling 8 cores)
> 2. used OpenMPI 1.7.5rc5 (later tested with 1.8 and observed the same)
> 3. with options
> A.) --map-by node:PE=4 and --bind-to core
> B.) --map-by node:PE=8 and --bind-to-core
> C.) --map-by socket and --bind-to none
> Timing of A,B,C came out as A < B < C, so used results from option A for Tx1xN in the graph.
> Could you please give some suggestion that may help to speed up these Tx1xN cases? Also, I expected B to perform better than A as threads could utilize all 8 cores, but it wasn't the case.
> Thank you,
> Saliya Ekanayake esaliya_at_[hidden]
> Cell 812-391-4914 Home 812-961-6383
> users mailing list