Open MPI logo

Hardware Locality Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Development mailing list

Subject: [hwloc-devel] Merging the PCI branch?
From: Brice Goglin (Brice.Goglin_at_[hidden])
Date: 2011-03-28 17:26:17


Le 14/03/2011 05:19, Samuel Thibault a écrit :
> I was wondering about merging the I/O branch:
> - people have not expressed what they want so much,
> - but people will probably not until it's exposed more,
> - it's really a useful thing, and works fine in our tests,
> - I'd like to see it out :)
>
> I believe the key issue could have been that applications may not expect
> to have objects with an empty cpuset, but since by default I/O objects
> are not added this is not an issue.
>

I worked on preparing the libpci branch for possible merge. Here are the
changes in the API for wider review.

Brice

First, to avoid breaking existing applications, I/O devices are not
added to the topology unless a new topology flag is set. Only lstopo
enables PCI devices by default.

We have 3 new object types:
* PCI devices, with usual PCI bus IDs and link speed attributes
* Bridges, with attributes for both sides, either host->pci or pci->pci
bridges for now.
* OS devices, which tell you which "ethX" interface, "sdX" block device,
"mlx4_0" IB NIC or "dma0chan1" DMA engine channel corresponds to a PCI
device.

As shown on the attached picture, the usual I/O subtree is, from top to
bottom:
* some hostbridge object are attached to some "normal" object (machine
or node)
* a tree of bridges may be behind the hostbridge
* pci devices are attached behind bridges
* some pci devices contain some OS device.

These new objects are special:
* They have no cpusets
* They may appear at random places in the topology, with very different
numbers of bridges above them. So we don't associate a "level" or a
"depth" to these new types. If you ever need to enumerate them, use the
new get_next_osdev() or get_next_pcidev() functions. This may need a bit
more of documentation.

libpci is needed to make this work. And only Linux gives you OS devices
for now (we use sysfs to translate between pci devs and os devs).

I also added some GPU-related OS devices by looking at DRM objects
(card0 and controlD64 in the attached picture). This only works with
free graphics drivers. Ideally we would have some Cuda or OpenCL device
ID there, but we'll likely need some specific plugins to do so. I don't
know if the current DRM objects are useful, we'll be able to remove them
later if needed.

--- a/include/hwloc.h
+++ b/include/hwloc.h
@@ -191,6 +191,17 @@ typedef enum {
                           * Objects without particular meaning, that can e.g. be
                           * added by the application for its own use.
                           */
+
+ HWLOC_OBJ_BRIDGE, /**< \brief Bridge.
+ * Any bridge that connects the host or an I/O bus,
+ * to another I/O bus.
+ */
+ HWLOC_OBJ_PCI_DEVICE, /**< \brief PCI device.
+ */
+
+ HWLOC_OBJ_OS_DEVICE, /**< \brief Operating system device.
+ */
+
   HWLOC_OBJ_MAX /**< \private Sentinel value */
 
     /* ***************************************************************
@@ -226,6 +237,20 @@ enum hwloc_compare_types_e {
     HWLOC_TYPE_UNORDERED = INT_MAX /**< \brief Value returned by hwloc_compare_types when types can not be compared. \hideinitializer */
 };
 
+
+typedef enum hwloc_obj_bridge_type_e {
+ HWLOC_OBJ_BRIDGE_HOST, /**< \brief Host-side of a bridge, only possible upstream. */
+ HWLOC_OBJ_BRIDGE_PCI /**< \brief PCI-side of a bridge. */
+} hwloc_obj_bridge_type_t;
+
+typedef enum hwloc_obj_osdev_type_e {
+ HWLOC_OBJ_OSDEV_BLOCK, /**< \brief Operating system block device. */
+ HWLOC_OBJ_OSDEV_GPU, /**< \brief Operating system GPU device. */
+ HWLOC_OBJ_OSDEV_NETWORK, /**< \brief Operating system network device. */
+ HWLOC_OBJ_OSDEV_INFINIBAND, /**< \brief Operating system infiniband device. */
+ HWLOC_OBJ_OSDEV_DMA /**< \brief Operating system dma device. */
+} hwloc_obj_osdev_type_t;
+
 /** @} */
 

@@ -403,6 +428,34 @@ union hwloc_obj_attr_u {
   struct hwloc_group_attr_s {
     unsigned depth; /**< \brief Depth of group object */
   } group;
+ /** \brief PCI Device specific Object Attributes */
+ struct hwloc_pcidev_attr_u {
+ unsigned short domain;
+ unsigned char bus, dev, func;
+ unsigned short class_id;
+ unsigned short vendor_id, device_id, subvendor_id, subdevice_id;
+ unsigned char revision;
+ float linkspeed; /* in GB/s */
+ } pcidev;
+ /** \brief Bridge specific Object Attribues */
+ struct hwloc_bridge_attr_u {
+ union hwloc_bridge_upstream_attr_u {
+ struct hwloc_pcidev_attr_u pci;
+ } upstream;
+ hwloc_obj_bridge_type_t upstream_type;
+ union hwloc_bridge_downstream_attr_u {
+ struct hwloc_bridge_downstream_pci_attr_u {
+ unsigned short domain;
+ unsigned char secondary_bus, subordinate_bus;
+ } pci;
+ } downstream;
+ hwloc_obj_bridge_type_t downstream_type;
+ unsigned depth;
+ } bridge;
+ /** \brief OS Device specific Object Attributes */
+ struct hwloc_osdev_attr_u {
+ hwloc_obj_osdev_type_t type;
+ } osdev;
 };
 
 /** \brief Distances between objects
 
 /** \brief Restrict the topology to the given CPU set.
@@ -1675,6 +1770,27 @@ HWLOC_DECLSPEC int hwloc_free(hwloc_topology_t topology, void *addr, size_t len)
 /** @} */
 
 
+
+/** \defgroup hwlocality_iodev Basic I/O Device Management
+ * @{
+ */
+
+/** \brief Get the next PCI device in the system.
+ *
+ * \return the first PCI device if \p prev is \c NULL.
+ */
+HWLOC_DECLSPEC struct hwloc_obj * hwloc_get_next_pcidev(struct hwloc_topology *topology, struct hwloc_obj *prev);
+
+/** \brief Get the next OS device in the system.
+ *
+ * \return the first OS device if \p prev is \c NULL.
+ */
+HWLOC_DECLSPEC struct hwloc_obj * hwloc_get_next_osdev(struct hwloc_topology *topology, struct hwloc_obj *prev);
+
+/** @} */
+
+
+
 #ifdef __cplusplus
 } /* extern "C" */
 #endif



pci.png