I have a followup question on this. In our application we have parallel for loops similar to OMP parallel for. I noticed that in order to gain speedup with threads I've to set --bind-to none, otherwise multiple threads will bind to same core giving no increase in performance. For example, I get following (attached) performance for a simple 3point stencil computation run with T threads on 1 MPI process on 1 node (Tx1x1).
My understanding is even when there are multiple procs per node we should use --bind-to none in order to get performance with threads. Is this correct? Also, what's the disadvantage of not using --bind-to core?
Your best performance with threads comes when you bind each process to multiple cores. Binding helps performance by ensuring your memory is always local, and provides some optimized scheduling benefits. You can bind to multiple cores by adding the qualifier "pe=N" to your mapping definition, like this:
mpirun --map-by socket:pe=4 ....
The above example will map processes by socket, and bind each process to 4 cores.
users mailing list