Open MPI logo

Portable Hardware Locality (hwloc) Documentation: v2.0.3

  |   Home   |   Support   |   FAQ   |  
I/O Devices

hwloc usually manipulates processing units and memory but it can also discover I/O devices and report their locality as well. This is useful for placing I/O intensive applications on cores near the I/O devices they use, or for gathering information about all platform components.

Enabling and requirements

I/O discovery is disabled by default (except in lstopo) for performance reasons. It can be enabled by changing the filtering of I/O object types to HWLOC_TYPE_FILTER_KEEP_IMPORTANT or HWLOC_TYPE_FILTER_KEEP_ALL before loading the topology, for instance with hwloc_topology_set_io_types_filter().

Note that I/O discovery requires significant help from the operating system. The pciaccess library (the development package is usually libpciaccess-devel or libpciaccess-dev) is needed to fully detect PCI devices and bridges. On Linux, PCI discovery may still be performed even if libpciaccess cannot be used. But it misses PCI device names. Moreover, some operating systems require privileges for probing PCI devices, see Does hwloc require privileged access? for details.

The actual locality of I/O devices is only currently detected on Linux. Other operating system will just report I/O devices as being attached to the topology root object.

I/O objects

When I/O discovery is enabled and supported, some additional objects are added to the topology. The corresponding I/O object types are:

Any of these types may be filtered individually with hwloc_topology_set_type_filter().

hwloc tries to attach these new objects to normal objects (usually NUMA nodes) to match their actual physical location. For instance, if a I/O Hub is physically connected to a package, the corresponding hwloc bridge object (and its PCI bridges and devices children) is inserted as a child of the corresponding hwloc Package object. These children are not in the normal children list but rather in the I/O-specific children list.

I/O objects also have neither CPU sets nor node sets (NULL pointers) because they are not directly usable by the user applications for binding. Moreover I/O hierarchies may be highly complex (asymmetric trees of bridges). So I/O objects are placed in specific levels with custom depths. Their lists may still be traversed with regular helpers such as hwloc_get_next_obj_by_type(). However, hwloc offers some dedicated helpers such as hwloc_get_next_pcidev() and hwloc_get_next_osdev() for convenience (see Finding I/O objects).

OS devices

Although each PCI device is uniquely identified by its bus ID (e.g. 0000:01:02.3), a user-space application can hardly find out which PCI device it is actually using. Applications rather use software handles (such as the eth0 network interface, the sda hard drive, or the mlx4_0 OpenFabrics HCA). Therefore hwloc tries to add software devices (HWLOC_OBJ_OS_DEVICE, also known as OS devices).

OS devices may be attached below PCI devices, but they may also be attached directly to normal objects. Indeed some OS devices are not related to PCI. For instance, NVDIMM block devices (such as pmem0s on Linux) are directly attached near their NUMA node (I/O child of the parent whose memory child is the NUMA node). Also, if hwloc could not discover PCI for some reason, PCI-related OS devices may also be attached directly to normal objects.

hwloc first tries to discover OS devices from the operating system, e.g. eth0, sda or mlx4_0. However, this ability is currently only available on Linux for some classes of devices.

hwloc then tries to discover software devices through additional I/O components using external libraries. For instance proprietary graphics drivers do not expose any named OS device, but hwloc may still create one OS object per software handle when supported. For instance the opencl and cuda components may add some opencl0d0 and cuda0 OS device objects.

Here is a list of OS device objects commonly created by hwloc components when I/O discovery is enabled and supported.

  • Hard disks (HWLOC_OBJ_OSDEV_BLOCK)
    • sda (Linux component)
  • Network interfaces (HWLOC_OBJ_OSDEV_NETWORK)
    • eth0, wlan0, ib0 (Linux component)
  • OpenFabrics (InfiniBand, Omni-Path, usNIC, etc) HCAs (HWLOC_OBJ_OSDEV_OPENFABRICS)
    • mlx5_0, hfi1_0, qib0, usnic_0 (Linux component)
  • GPUs (HWLOC_OBJ_OSDEV_GPU)
    • nvml0 for the first NVML device (NVML component, using the NVIDIA Management Library)
    • :0.0 for the first display (GL component, using the NV-CONTROL X extension library, NVCtrl)
  • Co-Processors (HWLOC_OBJ_OSDEV_COPROC)
    • opencl0d0 for the first device of the first OpenCL platform, opencl1d3 for the fourth device of the second OpenCL platform (OpenCL component)
    • cuda0 for the first NVIDIA CUDA device (CUDA component, using the NVIDIA CUDA Library)
    • mic0 for the first Intel Xeon Phi (MIC) coprocessor (Linux component)
  • DMA engine channel (HWLOC_OBJ_OSDEV_DMA)

Note that some PCI devices may contain multiple software devices (see the example below).

See also Interoperability With Other Software for managing these devices without considering them as hwloc objects.

PCI devices and bridges

A PCI hierarchy is usually organized as follows: A hostbridge object ( HWLOC_OBJ_BRIDGE object with upstream type Host and downstream type PCI) is attached below a normal object (usually the entire machine or a NUMA node). There may be multiple hostbridges in the machine, attached to different places, but all PCI devices are below one of them (unless the Bridge object type is filtered-out).

Each hostbridge contains one or several children, either other bridges (usually PCI to PCI) or PCI devices (HWLOC_OBJ_PCI_DEVICE). The number of bridges between the hostbridge and a PCI device depends on the machine.

Consulting I/O devices and binding

I/O devices may be consulted by traversing the topology manually (with usual routines such as hwloc_get_obj_by_type()) or by using dedicated helpers (such as hwloc_get_pcidev_by_busid(), see Finding I/O objects).

I/O objects do not actually contain any locality information because their CPU sets and node sets are NULL. Their locality must be retrieved by walking up the object tree (through the parent link) until an non-I/O object is found (see hwloc_get_non_io_ancestor_obj()). This normal object should have non-NULL CPU sets and node sets which describe the processing units and memory that are immediately close to the I/O device. For instance the path from a OS device to its locality may go across a PCI device parent, one or several bridges, up to a Package node with the same locality.

Command-line tools are also aware of I/O devices. lstopo displays the interesting ones by default (passing --no-io disables it).

hwloc-calc and hwloc-bind may manipulate I/O devices specified by PCI bus ID or by OS device name.

  • pci=0000:02:03.0 is replaced by the set of CPUs that are close to the PCI device whose bus ID is given.
  • os=eth0 is replaced by CPUs that are close to the I/O device whose software handle is called eth0.

This enables easy binding of I/O-intensive applications near the device they use.

Examples

The following picture shows a dual-package dual-core host whose PCI bus is connected to the first package and NUMA node.

devel09-pci.png

Six interesting PCI devices were discovered. However, hwloc found some corresponding software devices (eth0, eth1, sda, mlx4_0, ib0, and ib1) for only four of these physical devices. The other ones (PCI 102b:0532 and PCI 8086:3a20) are an unused IDE controller (no disk attached) and a graphic card (no corresponding software device reported to the user by the operating system).

On the contrary, it should be noted that three different software devices were found for the last PCI device (PCI 15b3:634a). Indeed this OpenFabrics HCA PCI device object contains one one OpenFabrics software device (mlx4_0) and two virtual network interface software devices (ib0 and ib1).

Here is the corresponding textual output:

Machine (24GB total)
  Package L#0
    NUMANode L#0 (P#0 12GB)
    L3 L#0 (8192KB)
      L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0)
      L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 + PU L#1 (P#2)
    HostBridge
      PCIBridge
        PCI 01:00.0 (Ethernet)
          Net "eth0"
        PCI 01:00.1 (Ethernet)
          Net "eth1"
      PCIBridge
        PCI 03:00.0 (RAID)
          Block "sda"
      PCIBridge
        PCI 04:03.0 (VGA)
      PCI 00:1f.2 (IDE)
      PCI 51:00.0 (InfiniBand)
        Net "ib0"
        Net "ib1"
        Net "mlx4_0"
  Package L#1
    NUMANode L#1 (P#1 12GB)
    L3 L#1 (8192KB)
      L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 + PU L#2 (P#1)
      L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 + PU L#3 (P#3)