I'm measuring barrier synchronization performance on the v1.5.1 build of OpenMPI. I am currently trying to measure synchronization performance on a single node, with 5 processes. I'm getting pretty weak results as follows:
Testing procedure - initialize the timer at the start of the barrier, stop the timer when the process break from the barrier. Cycle through N number of times and calculate the average.
1 Node 5 processes: 299.38ms
1 Node 7 processes: 513.95ms
1 Node 10 processes: 749.94ms
I am wondering if this is the expected performance on a single nodes. I presume Open MPI automatically uses Shared Memory for barrier synchronization on a single node which I think should be able to provide better performance when running on a single node. Is there a way to determine what transport layer I am using and I would greatly appreciate tips on how can I tune this performance.