All,
If you wanted to speedup these routines for processors without __builtin_clz,
there are a variety of variations in C to implement clz efficiently.
See Hacker's Delight nlz (number of leading zeros):
http://www.hackersdelight.org/HDcode/nlz.c.txt
Or from my Ph.D. advisor's magic algorithm's page:
http://aggregate.org/MAGIC/#Leading%20Zero%20Count
And you can directly implement opal_next_poweroftwo()
with this:
http://aggregate.org/MAGIC/#Next%20Largest%20Power%20of%202
The Hacker's Delight webpage (and book) are fun to read for that
certain kind of person. :-)
http://www.hackersdelight.org/
On Tue, Oct 11, 2011 at 6:49 PM, <rusraink_at_[hidden]> wrote:
> Author: rusraink
> Date: 2011-10-11 18:49:01 EDT (Tue, 11 Oct 2011)
> New Revision: 25270
> URL: https://svn.open-mpi.org/trac/ompi/changeset/25270
>
> Log:
> - Check, whether the compiler supports __builtin_clz (count leading
> zeroes);
> if so, use it for bit-operations like opal_cube_dim and opal_hibit.
> Implement two versions of power-of-two.
> In case of opal_next_poweroftwo, this reduces the average execution
> time from 83 cycles to 4 cycles (Intel Nehalem, icc, -O2, inlining,
> measured rdtsc, with loop over 2^27 values).
> Numbers for other functions are similar (but of course heavily depend
> on the usage, e.g. opal_hibit() with a start of 4 does not save
> much). The bsr instruction on AMD Opteron is also not as fast.
>
> - Replace various places where the next power-of-two is computed.
>
> Tested on Intel Nehalem Cluster with openib, compilers GNU-4.6.1 and
> Intel-12.0.4 using mpi_testsuite -t "Collective" with 128 processes.
>
>
> Added:
> trunk/test/util/opal_bit_ops.c
> Text files modified:
> trunk/ompi/mca/btl/openib/btl_openib_mca.c | 13 +---
> trunk/ompi/mca/btl/sm/btl_sm.h | 5 -
> trunk/ompi/mca/btl/sm/btl_sm_component.c | 9 +--
> trunk/ompi/mca/btl/wv/btl_wv_mca.c | 13 +---
> trunk/ompi/mca/coll/basic/coll_basic_reduce_scatter.c | 5 +
> trunk/ompi/mca/coll/tuned/coll_tuned_allgather.c | 3
> trunk/ompi/mca/coll/tuned/coll_tuned_allreduce.c | 4 +
> trunk/ompi/mca/coll/tuned/coll_tuned_barrier.c | 5 +
> trunk/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c | 5 +
> trunk/ompi/mca/coll/tuned/coll_tuned_reduce_scatter.c | 5 +
> trunk/ompi/mca/coll/tuned/coll_tuned_topo.c | 3
> trunk/opal/class/opal_hash_table.c | 8 --
> trunk/opal/config/opal_setup_cc.m4 | 20 ++++++
> trunk/opal/util/bit_ops.h | 106 +++++++++++++++++++++++++++++++++++----
> trunk/test/util/Makefile.am | 14 ++++-
> 15 files changed, 158 insertions(+), 60 deletions(-)
>
[snip]
--
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
timattox_at_[hidden] || tmattox_at_[hidden]
I'm a bright... http://www.the-brights.net/
|