Open MPI logo

Hardware Locality Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Development mailing list

Subject: Re: [hwloc-devel] 1.3.1 and 1.4rc1
From: Samuel Thibault (samuel.thibault_at_[hidden])
Date: 2011-12-13 12:17:37


Brice Goglin, le Tue 13 Dec 2011 14:10:17 +0100, a écrit :
> My main problem is that it's hard to know whether this will look good in
> two years when we'll have support for AMD APU, Intel MIC and other
> "strange" architectures. Which types should be common to CPUs and these
> accelerators? Might be easy to answer for MIC,

And still. MIC cores are not something you can just bind to.

> but much harder for GPUs.

>From the programming point of view it's not so different actually.

> I actually thought you would use PUs for GPU threads. But actually none
> of PU and Core really satisfies me. Core looks too big given the small
> abilities of GPU threads. But using PU for GPUs might cause problem
> because we can't bind tasks to individual GPU threads.

Just like we can't directly bind tasks to MIC cores, or to PUs of
another machine, with a topology that includes a whole cluster.

> Also I don't think the GPU caches should be L2 because they are not very
> similar to the CPU ones.

How so?

> * We need a --disable-cuda.

Oops, support was already there, I just forgot to add the actual option,
now done.

> Given the libnuma or libpci mess, there's no way I can think that
> always keeping CUDA enabled will work fine in most cases.

What do you think can go wrong?

> * I don't like calling some CUDA function without init() first, it could
> break one day or another. Fortunately I can't find any cudaInit()
> function in the API you use (there's a cuInit() in the other one only).
> Do we have any doc saying whether the CUDA functions you use actually
> require an init() or not?

Quoting the documentation: “There is no explicit initialization
function for the runtime; it initializes the first time a runtime
function is called”.

> * About the "tight" attribute, can't you just make a special case when
> you're inside a GPU?

I don't like such kind of special-casing: in the future we could very
well also have a full-fledged core alongside an MP on the GPU.

> * About decoration, the lstopo output is totally unreadable on machine
> with several "big" GPUs. I wonder if we actually need to display all GPU
> threads like this or just say "16 SM with 32 threads each" instead?

Well, we don't do such summary for very big machines like our 96core
machine either...

> Last feeling: The more I think about PCI support, the more I wonder
> whether it will be used for anything but getting nice lstopo outputs.
> Inline helpers are already great for many cases, people just need
> locality info in most cases,

Which they can need to reconnect with actual hwloc object to
redistribute e.g. threads amongs cores inside the socket etc. Or
reconnect with the NUMA distance to check for PCI-memory transfers, be
it for NICs, GPUs, etc.

Samuel