On Fri, 2010-12-03 at 09:54 +0100, Brice Goglin wrote:
> Le 02/12/2010 22:25, Bernd Kallies a écrit :
> >> Do you have any feel for if there are particular bottlenecks in hwloc / lstopo that make it take so long? I wonder if we should just attack those (if possible)...? Samuel and Brice have done all the work in the guts of the API, so they might know offhand if there are places that can be optimized or not...
> > Hmm. I did no profiling. The machines in question have 64 NUMA nodes
> > with 16 logical CPUs, each. The topology depth is 10. So parsing
> > of /sys/devices/system/node/* and evaluating the distance matrix to
> > fiddle out the topology tree should be quite expensive. But I guess this
> > statement is trivial and does not help very much.
> We should really encourage people to use XML in such cases. Setting
> HWLOC_XMLFILE=/path/to/exported/file.xml in the environment should just
> work (as long as you update the XML file major hwloc releases or os).
> Maybe we should add a dedicated section about this in the documentation?
> Something like "Speeding up hwloc on large nodes"? And maybe even
> encourage distro-packager to create a XML export file under /var/lib
> with an advice to add HWLOC_XMLFILE to /etc/environment if they care
> about hwloc/HPC?
> Anyway Bernd, can you export a XML on this nice machine and reload it
> and see how long it takes? I hope all the bottlenecks are in the Linux
> backend parsing /sys and /proc, not in the actual hwloc core.
I'm not sure if I understood. From my point of view it makes no sense to
create an XML representation of the topology with hwloc, and then read
this XML in again to be able to dive into it to figure out something.
When there is an API that provides direct access to parts of the
topology once it is constructed, then the XML thing is useless
However, one may prepare a static XML representation of the machine
topology at boot time and store it somewhere for public access, as you
suggested (or as I understood you). But this would not help us in many
of our use cases. We have to analyze topologies that do not represent a
whole machine. We analyze topologies that are bound to cpusets. We do
this e.g. to construct pinning schemes for MPI applications that run
inside of batch jobs, which get their cpusets created on the fly
depending on their resource requests and current load of the machine.
You find current implementations of the strategy of calculating pinning
schemes based on hwloc topology information e.g. in recent MVAPICH2. I
cannot see advantages of XML representations for this purpose. The
problem here is rather, if every task running on a node should read the
topology and figure out on which CPU it should pin itself, or if one
does this by one master task on a node, and communicate the result to
the others. But this is outside of hwloc.
> By the way, we're not the only project with little scalability problems
> on very large machines: https://lkml.org/lkml/2010/12/3/19 :)
> hwloc-devel mailing list
Dr. Bernd Kallies
Konrad-Zuse-Zentrum für Informationstechnik Berlin