Thank you, Jeff and Ganesh.
My current research is trying to rewrite some collective MPI
operations to work with our system. Barrier is my first step, maybe
I will have bcast and reduce in the future. I understand that some
applications used too many unnecessary barriers. But here what I
want is just an application to measure the performance improvement
versus normal MPI_Barrier. And the improvement can only be measured
if the barriers are executed many times. I have done some synthetic
tests, all I need now are real applications.