Le 13/12/2011 12:14, Samuel Thibault a écrit :
> Do we merge the cuda branch into 1.4? I didn't do the work directly
> into the trunk because I wasn't sure of what I'd need to add to the
> interface. Eventually, the additions are
> - the "tight" field, which just means whether children are tightly packed,
> such as cores in an nvidia MultiProcessor, i.e. it's mostly a decorative
> attribute for the drawing function.
> - the "MEM" object type, which represents embedded memory, not a NUMA node.
My main problem is that it's hard to know whether this will look good in
two years when we'll have support for AMD APU, Intel MIC and other
"strange" architectures. Which types should be common to CPUs and these
accelerators? Might be easy to answer for MIC, but much harder for GPUs.
I actually thought you would use PUs for GPU threads. But actually none
of PU and Core really satisfies me. Core looks too big given the small
abilities of GPU threads. But using PU for GPUs might cause problem
because we can't bind tasks to individual GPU threads.
Also I don't think the GPU caches should be L2 because they are not very
similar to the CPU ones. I don't know how to handle these. If we add a
cache type for instruction/data/unified, there's could be a special type
for embedded caches.
On the technical side:
* We need a --disable-cuda. Given the libnuma or libpci mess, there's no
way I can think that always keeping CUDA enabled will work fine in most
cases. The good news is that cuda may not often be installed in /usr, so
I hope configure will not find it automatically in most cases :)
* I don't like calling some CUDA function without init() first, it could
break one day or another. Fortunately I can't find any cudaInit()
function in the API you use (there's a cuInit() in the other one only).
Do we have any doc saying whether the CUDA functions you use actually
require an init() or not?
* About the "tight" attribute, can't you just make a special case when
you're inside a GPU? It's strange to expose this in the API just for
* About decoration, the lstopo output is totally unreadable on machine
with several "big" GPUs. I wonder if we actually need to display all GPU
threads like this or just say "16 SM with 32 threads each" instead?
Last feeling: The more I think about PCI support, the more I wonder
whether it will be used for anything but getting nice lstopo outputs.
Inline helpers are already great for many cases, people just need
locality info in most cases, so I wonder if people will actually use PCI
as hwloc objects except in lstopo. The same question comes for GPUs.