I am evaluating the performance of a clustering program written in Java
with MPI+threads and would like to get some insight in solving a peculiar
case. I've attached a performance graph to explain this.
In essence the tests were carried out as TxPxN, where T is threads per
process, P is processes per node, and N is number of nodes. I noticed an
inefficiency with Tx*1*xN cases in general (tall bars in graph).
To elaborate a bit further,
1. each node has 2 sockets with 4 cores each (totaling 8 cores)
2. used OpenMPI 1.7.5rc5 (later tested with 1.8 and observed the same)
3. with options
A.) --map-by node:PE=4 and --bind-to core
B.) --map-by node:PE=8 and --bind-to-core
C.) --map-by socket and --bind-to none
Timing of A,B,C came out as A < B < C, so used results from option A for Tx
*1*xN in the graph.
Could you please give some suggestion that may help to speed up these Tx*1*xN
cases? Also, I expected B to perform better than A as threads could utilize
all 8 cores, but it wasn't the case.
[image: Inline image 1]
Saliya Ekanayake esaliya_at_[hidden]
Cell 812-391-4914 Home 812-961-6383