Resent to the group.
From: Kenneth A. Lloyd [mailto:kenneth.lloyd_at_[hidden]]
Sent: Tuesday, January 22, 2013 6:05 AM
To: 'Brice Goglin'
Subject: RE: [hwloc-users] hwloc tutorial material
Here's the primary issue (at least my primary issue):
There is a structure to computational problem spaces - on a continuum from
regular to amorphous. Predicated on the fact that there is the size and
structure of data, and there is a size and structure to the program
execution graph (a network graph, that is not always the same, conditioned
by the data). Given these conditions, how does one look at the compute
capability of an existing cluster, and configure the compute fabric (shmem
distribution, and associated affinities across various devices) to
effectively and efficiently address the problem? Our current solution tends
toward: The programmer hard-codes the solution or the user uses heuristics
to make those determinations.
We have cast this as a CUDA problem, but it is more universal than that with
other MPP languages (as you mentioned), Xeon Phi, other GPUs, and FPGAs. In
a heterogeneous cluster, the asymmetries may complicate the solution (as may
nodes being down, checkpoint / restart schedules). Of course it is
incumbent for the hardware to reflect information about its capability
(beyond the scope of hwloc).
Sure, we can poll the nodes and use cudaGetDeviceProperties to build up
potential graphs (put them in a XML-DOM or other data structure) - but even
there, we generally (still) have to use associated lookup tables (IMO, a
cheesy option in this day and age, but I digress).
I think I understand the general direction for HPC computation using OpenMPI
w/ hwloc. Perhaps a more flexible MPI using MPI_Dist_Graph (missing at
present) is warranted?
I'll get off my stump now.
From: Brice Goglin [mailto:Brice.Goglin_at_[hidden]]
Sent: Tuesday, January 22, 2013 5:15 AM
To: Kenneth A. Lloyd
Cc: Hardware locality user list
Subject: Re: [hwloc-users] hwloc tutorial material
Le 22/01/2013 10:27, Samuel Thibault a écrit :
> Kenneth A. Lloyd, le Mon 21 Jan 2013 22:46:37 +0100, a écrit :
>> Thanks for making this tutorial available. Using hwloc 1.7, how far
>> down into, say, NVIDIA cards can the architecture be reflected?
>> Global memory size? SMX cores? None of the above?
> None of the above for now. Both are available in the cuda svn branch,
Now the question to Kenneth is "what do YOU need?"
I didn't merge the GPU internals into the trunk yet because I'd like to see
if that matches what we would do with OpenCL and other accelerators such as
the Xeon Phi.
One thing is keep in mind is that most hwloc/GPU users will use hwloc to get
locality information but they will also still use CUDA to use the GPU. So
they will still be able to use CUDA to get in-depth GPU information anyway.
Then the question is how much CUDA info do we want to duplicate in hwloc.
hwloc could have the basic/uniform GPU information and let users rely on
CUDA for everything CUDA-specific for instance. Right now, the basic/uniform
part is almost empty (just contain the GPU model name or so).
Also the CUDA branch creates hwloc objects inside the GPU to describe the
memory/cores/caches/... Would you use these objects in your application ? or
would you rather just have a basic GPU attribute structure containing the
number of SMX, the memory size, ... One problem with this is that it may be
hard to define a structure that works for all GPUs, even only the NVIDIA
ones. We may need an union of structs...
I am talking about "your application" above because having lstopo draw very
nice GPU internals doesn't mean the corresponding hwloc objects are useful
to real application.