I have been running the IMB performance tests and noticed some strange
behavior. This is running on a CentOS cluster with 16 processes per
node and using the openib btl. Currently, I am looking at the
MPI_Barrier performance. Since we make use of a recursive double
algorithm (in the tuned collective) I would have expected to see a
log2(np) type performance. However, the data is much worse than
log2(np) with the trunk being worse than v1.2.4.
One interesting piece of data is that I replaced the tuned algorithm
with one that is very similar (copied from Sun Clustertools 6) , but
instead of each process doing send/recv, we have each one do a send to
their lower partners followed by a receive with their upper partners.
Then, everything is reversed which finished the barrier. For reasons
unknown, this appears to perform better even thought both algorithms
should be log2(np).
Another interesting fact is that when run on my really slow cluster over
TCP (latency of 150 usec) the tuned barrier algorithm very closely
follows the expected log2(np).
I have mentioned this issue to a few people, but thought I would share
with a wider audience to see if anyone else has observed MPI_Barrier
that is not log2(np). I have attached two pdfs. The first one shows
my results and the second one is a picture of the two different barrier