My name is not Kenneth, but I won't forego the opportunity to describe the needs of MY application (Cactus)...
Currently, our CUDA functionality is more efficient, but our OpenCL functionality is more mature. We would like to use hwloc to obtain the following information for GPUs, as we already do for CPUs:
- number of cores
- number of PUs per core ("hardware threads"); both for choosing good numbers of threads, and for deciding how "close" they should be in terms of memory they access. (Neither OpenMP nor OpenCL distinguish between multi-core threading and SMT.)
- cache size of L1, or L2 cache if L1 cache is small
- cache line size (for array padding)
- cache stride (or associativity) for memory allocation
- fastest core / fastest NUMA node from which a GPU can be accessed
To date, we collect some of this information in a "database" with one entry per system that we are using. This works well for development, but in the end, we need to collect this information automatically.