We're integrating hwloc deeper into Open MPI, and are also using it in some other projects. As a result, I have some practical feedback.
(SIDENOTE: reading this letter after I wrote it, I see that it sounds overwhelmingly negative. Please do not take it that way! We love the use of hwloc in our projects -- we just have some very specific feedback on things that we would like changed. We're just not noting all the GOOD things about hwloc in this email. I'll send pink hwloc ponies and rainbows in a different email)
1. The depth-specific accessors are Bad. Given the warning language in the docs paired with the practical realities that some people actually do mix and match CPUs in a single server (especially when testing new chips), the depth-based accessors *can/will* fail. Meaning: you have to write application code that can handle the non-uniform depth cases, making the depth-based accessors essentially useless.
Note that this also affects the non-depth-based accessors, because many (all?) of them just turn around and call the depth-based accessors. :-\
In every project where we've used hwloc, we've ended up writing tree traversal based accessors (e.g., DFS kinds of search algorithms) for things like finding the Nth object of a given type for the reasons listed above.
We suggest adding DFS-based accessors and possibly deprecating the depth-based accessors. True, the depth-based accessors can be faster to find a given object, but even with a really large hwloc tree (e.g., 1,000 PUs), does the speed difference really matter? I wonder if people put hwloc searches in critical performance code paths where such a cost difference would matter -- we certainly don't. FWIW: we've generally used hwloc during setup phases.
2. All caches are listed as HWLOC_OBJ_CACHE, regardless of their level. We would like to request changing these to having specific enums for each level of cache -- perhaps adding HWLOC_OBJ_CACHE_L1 through L10 to cover future possible platforms.
The reasons we are asking for this are as follows:
(2a) the depth-based accessors are automatically broken for any machine with more than one level of cache (i.e., they return -1 because caches exist at multiple levels). Yes, #1 expounded on how the depth-based accessors are bad, but I mention this point anyway. :-)
(2b) by the same logic, calling get_nbobjs() on HWLOC_OBJ_CACHE fails.
(2c) to find the set of any given Lx caches, you basically have to traverse the tree looking for HWLOC_OBJ_CACHE *and* attr->cache.depth==x. It would be cleaner if we could just look for HWLOC_OBJ_CACHE_L<whatever>.
(2d) more specifically: since all caches are of type HWLOC_OBJ_CACHE, we find ourselves putting in special case logic for caches all over our code. Ick.
Note: I'm not sure how to add new HWLOC_OBJ_CACHE_Lx types and preserve backwards compatibility. :-\
3. It would be really great to have some kind of flag in each object that says whether all of its children are homogeneous or not.
Specifically: if the flag is true, it means that the trees rooted by obj->children[i] are "the same", meaning that each contain the same number of same-typed objects in the same topology layout, and have the same attributes (e.g., their memory sizes are the same, etc.).
Of course, the OS indexes and cpusets will be different between the objects in the different trees. The homogeneous flag does not apply to those kinds of things.
But having this flag means that you might be able to traverse just the obj->children tree and then be able to prune all other DFS searches and extrapolate the discovered results.
We ended up implementing this kind of feature in a struct hanging off obj->userdata; it saved extra compute cycles and some extra logic in some cases.
4. src/topology-synthetic.c emits error messages on stderr when you try to import invalid XML. I am guessing that this was put there because it's a much more specific error message than simply returning, for example, EINVAL -- the stderr message tells you the XML file line number of the problem, for example.
Could this be done in a different way? I ask because test suites that import synthetic XML to hwloc do not want their stdout/stderr interrupted.
Here's two suggestions from Andreas Kupries (tcl-hwloc guy) for this issue:
(4a) One possibility would be a general error callback the user can configure and which is invoked with such messages. An application can then set up a callback doing whatever processing is proper for it. Or NULL to disable.
(4b) A second possibility would be to save the message internally somewhere (can't use the topology, unfortunately) and provide an API to read it from there. Then, when an application gets an error it can use that API to get a more detailed message, if any, and process it as it sees fit. This has thread safety implications, however.
5. The XML dump of the topology doesn't include all the support information, such as whether you can bind to threads/cores/etc. I'm guessing this was done because the emphasis on importing XML was for drawing pretty lstopo pictures.
But we're using the XML export in OMPI to send the topology of compute nodes up to the scheduler, where decisions are made about how to lay out processes on the back-end compute nodes, what the binding width will be, etc. This front-end scheduler needs to know whether the back-end node is capable of supporting binding, for example.
We manually added this information into the message that we send up to the scheduler, but it would be much nicer if the XML export/import just handled that automatically.
For corporate legal information go to: