Linux virtualization and PCI passthrough

Processors have evolved to improve performance for virtualized environments, but what about I/O aspects? Discover one such I/O performance enhancement called device (or PCI) passthrough. This innovation improves performance of PCI devices using hardware support from Intel (VT-d) or AMD (IOMMU).

Platform virtualization is about sharing a platform among two or more operating systems for more efficient use of resources. But platform implies more than just a processor: it also includes the other important elements that make up a platform, including storage, networking, and other hardware resources. Some hardware resources can easily be virtualized, such as the processor or storage, but other hardware resources cannot, such as a video adapter or a serial port. Peripheral Component Interconnect (PCI) passthrough provides the means to use those resources efficiently, when sharing is not possible or useful. This article explores the concept of passthrough, discusses its implementation in hypervisors, and details the hypervisors that support this recent innovation.

Platform device emulation

Before we jump into passthrough, let‘s explore how device emulation works today in two hypervisor architectures. The first architecture incorporates device emulation within the hypervisor, while the second pushes device emulation to a hypervisor-external application.

Device emulation within the hypervisor is a common method implemented within the VMware workstation product (an operating system-based hypervisor). In this model, the hypervisor includes emulations of common devices that the various guest operating systems can share, including virtual disks, virtual network adapters, and other necessary platform elements. This particular model is shown in Figure 1.

Figure 1. Hypervisor-based device emulation

The second architecture is called user space device emulation (see Figure 2). As the name implies, rather than the device emulation being embedded within the hypervisor, it is instead implemented in user space. QEMU (which provides not only device emulation but a hypervisor as well) provides for device emulation and is used by a large number of independent hypervisors (Kernel-based Virtual Machine [KVM] and VirtualBox being just two). This model is advantageous, because the device emulation is independent of the hypervisor and can therefore be shared between hypervisors. It also permits arbitrary device emulation without having to burden the hypervisor (which operates in a privileged state) with this functionality.

Figure 2. User space device emulation

Pushing the device emulation from the hypervisor to user space has some distinct advantages. The most important advantage relates to what‘s called the trusted computing base (TCB). The TCB of a system is the set of all components that are critical to its security. It stands to reason, then, that if the system is minimized, there exists a smaller probability of bugs and, therefore, a more secure system. The same idea exists with the hypervisor. The security of the hypervisor is crucial, as it isolates multiple independent guest operating systems. With less code in the hypervisor (pushing the device emulation into the less privileged user space), the less chance of leaking privileges to untrusted users.

Another variation on hypervisor-based device emulation is paravirtualized drivers. In this model, the hypervisor includes the physical drivers, and each guest operating system includes a hypervisor-aware driver that works in concert with the hypervisor drivers (called paravirtualized, or PV, drivers).

Regardless of whether the device emulation occurs in the hypervisor or on top in a guest virtual machine (VM), the emulation methods are similar. Device emulation can mimic a specific device (such as a Novell NE1000 network adapter) or a specific type of disk (such as an Integrated Device Electronics [IDE] drive). The physical hardware can differ greatly—for example, while an IDE drive is emulated to the guest operating systems, the physical hardware platform can use a serial ATA (SATA) drive. This is useful, because IDE support is common among many operating systems and can be used as a common denominator instead of all guest operating systems supporting more advanced drive types.

Back to top

Device passthrough

As you can see in the two device emulation models discussed above, there‘s a price to pay for sharing devices. Whether device emulation is performed in the hypervisor or in user space within an independent VM, overhead exists. This overhead is worthwhile as long as the devices need to be shared by multiple guest operating systems. If sharing is not necessary, then there are more efficient methods for sharing devices.

So, at the highest level, device passthrough is about providing an isolation of devices to a given guest operating system so that the device can be used exclusively by that guest (see Figure 3). But why is this useful? Not surprisingly, there are a number of reasons why device passthrough is worthwhile. Two of the most important reasons are performance and providing exclusive use of a device that is not inherently shareable.

Figure 3. Passthrough within the hypervisor

For performance, near-native performance can be achieved using device passthrough. This is perfect for networking applications (or those that have high disk I/O) that have not adopted virtualization because of contention and performance degradation through the hypervisor (to a driver in the hypervisor or through the hypervisor to a user space emulation). But assigning devices to specific guests is also useful when those devices cannot be shared. For example, if a system included multiple video adapters, those adapters could be passed through to unique guest domains.

Finally, there may be specialized PCI devices that only one guest domain uses or devices that the hypervisor does not support and therefore should be passed through to the guest. Individual USB ports could be isolated to a given domain, or a serial port (which is itself not shareable) could be isolated to a particular guest.

Back to top

Underneath the covers of device emulation

Early forms of device emulation implemented shadow forms of device interfaces in the hypervisor to provide the guest operating system with a virtual interface to the hardware. This virtual interface would consist of the expected interface, including a virtual address space representing the device (such as shadow PCI) and virtual interrupt. But with a device driver talking to a virtual interface and a hypervisor translating this communication to actual hardware, there‘s a considerable amount of overhead—particularly in high-bandwidth devices like network adapters.

Xen popularized the PV approach (discussed in the previous section), which reduced the degradation of performance by making the guest operating system driver aware that it was being virtualized. In this case, the guest operating system would not see a PCI space for a device (such as a network adapter) but instead a network adapter application programming interface (API) that provided a higher-level abstraction (such as a packet interface). The downside to this approach was that the guest operating system had to be modified for PV. The upside was that you can achieve near-native performance in some cases.

Early attempts at device passthrough used a thin emulation model, in which the hypervisor provided software-based memory management (translating guest operating system address space to trusted host address space). And while early attempts provided the means to isolate a device to a particular guest operating system, the approach lacked the performance and scalability required for large virtualization environments. Luckily, processor vendors have equipped next-generation processors with instructions to support hypervisors as well as logic for device passthrough, including interrupt virtualization and direct memory access (DMA) support. So, instead of catching and emulating access to physical devices below the hypervisor, new processors provide DMA address translation and permissions checking for efficient device passthrough.

Hardware support for device passthrough

Both Intel and AMD provide support for device passthrough in their newer processor architectures (in addition to new instructions that assist the hypervisor). Intel calls its option Virtualization Technology for Directed I/O (VT-d), while AMD refers to I/O Memory Management Unit (IOMMU). In each case, the new CPUs provide the means to map PCI physical addresses to guest virtual addresses. When this mapping occurs, the hardware takes care of access (and protection), and the guest operating system can use the device as if it were a non-virtualized system. In addition to mapping guest to physical memory, isolation is provided such that other guests (or the hypervisor) are precluded from accessing it. The Intel and AMD CPUs provide much more virtualization functionality. You can learn more in the Resources section.

Another innovation that helps interrupts scale to large numbers of VMs is called Message Signaled Interrupts (MSI). Rather than relying on physical interrupt pins to be associated with a guest, MSI transforms interrupts into messages that are more easily virtualized (scaling to thousands of individual interrupts). MSI has been available since PCI version 2.2 but is also available in PCI Express (PCIe), where it allows fabrics to scale to many devices. MSI is ideal for I/O virtualization, as it allows isolation of interrupt sources (as opposed to physical pins that must be multiplexed or routed through software).

Back to top

Hypervisor support for device passthrough

Using the latest virtualization-enhanced processor architectures, a number of hypervisors and virtualization solutions support device passthrough. You‘ll find support for device passthrough (using VT-d or IOMMU) in Xen and KVM as well as other hypervisors. In most cases, the guest operating system (domain 0) must be compiled to support passthrough, which is available as a kernel build-time option. Hiding the devices from the host VM may also be required (as is done with Xen using pciback). Some restrictions apply in PCI (for example, PCI devices behind a PCIe-to-PCI bridge must be assigned to the same domain), but PCIe does not have this restriction.

Additionally, you‘ll find configuration support for device passthrough in libvirt (along with virsh), which provides an abstraction to the configuration schemes used by the underlying hypervisors.

Back to top

Problems with device passthrough

One of the problems introduced with device passthrough is when live migration is required. Live migration is the suspension and subsequent migration of a VM to a new physical host, at which point the VM is restarted. This is a great feature to support load balancing of VMs over a network of physical hosts, but it presents a problem when passthrough devices are used. PCI hotplug (of which there are several specifications) is one aspect that needs to be addressed. PCI hotplug permits PCI devices to come and go from a given kernel, which is ideal—particularly when considering migration of a VM to a hypervisor on a new host machine (devices need to be unplugged, and then subsequently plugged in at the new hypervisor). When devices are emulated, such as virtual network adapters, the emulation provides a layer to abstract away the physical hardware. In this way, a virtual network adapter migrates easily within the VM (also supported by the Linux? bonding driver, which allows multiple logical network adapters to be bonded to the same interface).

Back to top

Next steps in I/O virtualization

The next steps in I/O virtualization are actually happening today. For example, PCIe includes support for virtualization. One virtualization concept that‘s ideal for server virtualization is called Single-Root I/O Virtualization (SR-IOV). This virtualization technology (created through the PCI-Special Interest Group, or PCI-SIG) provides device virtualization in single-root complex instances (in this case, a single server with multiple VMs sharing a device). Another variation, called Multi-Root IOV, supports larger topologies (such as blade servers, where multiple servers can access one or more PCIe devices). In a sense, this permits arbitrarily large networks of devices, including servers, end devices, and switches (complete with device discovery and packet routing).

With SR-IOV, a PCIe device can export not just a number of PCI physical functions but also a set of virtual functions that share resources on the I/O device. The simplified architecture for server virtualization is shown in Figure 4. In this model, no passthrough is necessary, because virtualization occurs at the end device, allowing the hypervisor to simply map virtual functions to VMs to achieve native device performance with the security of isolation.

Figure 4. Passthrough with SR-IOV

Back to top

Going further

Virtualization has been under development for about 50 years, but only now is there widespread attention on I/O virtualization. Commercial processor support for virtualization has been around for only five years. So, in essence, we‘re on the cusp of what‘s to come for platform and I/O virtualization. And as a key element of future architectures like cloud computing, virtualization will certainly be an interesting technology to watch as it evolves. As usual, Linux is on the forefront for support of these new architectures, and recent kernels (2.6.27 and beyond) are beginning to include support for these new virtualization technologies.

Resources

Learn
Get products and technologies
  • With IBM trial software, available for download directly from developerWorks, build your next development project on Linux.

Linux virtualization and PCI passthrough

时间: 2024-10-18 11:17:18

Linux virtualization and PCI passthrough的相关文章

Linux Virtualization with Xen

Xen is the new virtualization kid on the block. It's gaining visibility and importance at a speed only projects such as Linux and Apache have seen before. Xen has been around for a couple of years: it was originally part of the Xenoserver platform, w

Linux驱动之PCI

<背景> PCI设备有许多地址配置的寄存器,初始化时这寄存器来配置设备的总线地址,配置好后CPU就可以访问该设备的各项资源了.(提炼:配置总线地址) <配置寄存器> (1)256字节的PCI配置空间分为64字节的头标区和192字节的设备相关区两部分.头标区的各个寄存器用来唯一地识别设备:设备相关区则保存一些与设备相关的数据. (2)配置空间的头标区又分为两部分:前16个字节的定义在各种类型的PCI设备中都是一样的:剩余的字节随设备类型不同而有所不同.位于偏移地址0EH处的头标类型字

Virtio: An I/O virtualization framework for Linux

The Linux kernel supports a variety of virtualization schemes, and that's likely to grow as virtualization advances and new schemes are discovered (for example, lguest). But with all these virtualization schemes running on top of Linux, how do they e

KVM 介绍(4):I/O 设备直接分配和 SR-IOV [KVM PCI/PCIe Pass-Through SR-IOV]

学习 KVM 的系列文章: (1)介绍和安装 (2)CPU 和 内存虚拟化 (3)I/O QEMU 全虚拟化和准虚拟化(Para-virtulizaiton) (4)I/O PCI/PCIe设备直接分配和 SR-IOV (5)libvirt 介绍 (6)OpenStack 和 KVM 本文将分析 PCI/PCIe 设备直接分配(Pass-through)和 SR-IOV, 以及三种 I/O 虚拟化方式的比较. 1. PCI/PCI-E 设备直接分配给虚机 (PCI Pass-through) 设

Linux下PCI设备驱动程序开发 --- PCI驱动程序实现(三)

三.PCI驱动程序实现 1. 关键数据结构 PCI设备上有三种地址空间:PCI的I/O空间.PCI的存储空间和PCI的配置空间.CPU可以访问PCI设备上的所有地址空间,其中I/O空间和存储空间提供给设备驱动程序使用,而配置空间则由Linux内核中的PCI初始化代码使用.内核在启动时负责对所有PCI设备进行初始化,配置好所有的PCI设备,包括中断号以及I/O基址,并在文件/proc/pci中列出所有找到的PCI设备,以及这些设备的参数和属性. Linux驱动程序通常使用结构(struct)来表示

《Linux Device Drivers》第十二章 PCI驱动程序——note

简介 本章给出一个高层总线架构的综述 讨论重点是用于访问Peripheral Component Interconnect(PCI,外围设备互联)外设的内核函数 PCI总线是内核中得到最好支持的总线 本章主要介绍PCI驱动程序如果寻找其硬件和获得对它的访问 本章也会介绍ISA总线 PCI接口 PCI是一组完整的规范,定义了计算机的各个不同部分之间应该如何交互 PCI规范涵盖了与计算机接口相关的大部分问题 PCI架构被设计为ISA标准的替代品,有三个主要目标 获得在计算机和外设之间传输数据时更好的

linux驱动---用I/O命令访问PCI总线设备配置空间

PCI总线推出以来,以其独有的特性受到众多厂商的青睐,已经成为计算机扩展总线的主流.目前,国内的许多技术人员已经具备开发PCI总线接口设备的能 力.但是PCI总线的编程技术,也就是对PCI总线设备的操作技术,一直是一件让技术人员感到头疼的事情.PCI总线编程的核心技术是对相应板卡配置空间 的理解和访问.一般软件编程人员基于对硬件设备原理的生疏,很难理解并操作配置空间,希望硬件开发人员直接告诉他们怎样操作:而PCI总线硬件开发人员虽 深刻地理解了其意义,在没有太多编程经验地前提下,也难于轻易地操作

HowTo: Xen 4.1.3 Windows 8 HVM domU with Intel HD4000 VGA Passthrough on Debian Wheezy

http://linux-bsd-sharing.blogspot.com/2012/10/howto-xen-413-windows-8-hvm-domu-with.html Update 05/07/2013:Despite the HowTo being close to 1 year it applies perfectly to this day. At the time Wheezy was Debian's Testing distribution and has since mo

Virtio:针对 Linux 的 I/O 虚拟化框架

from:http://www.ibm.com/developerworks/cn/linux/l-virtio/ 概而言之,virtio 是半虚拟化 hypervisor 中位于设备之上的抽象层.virtio 由 Rusty Russell 开发,他当时的目的是支持自己的虚拟化解决方案 lguest.本文在开篇时介绍半虚拟化和模拟设备,然后探索 virtio 的细节.本文的重点是来自 2.6.30 内核发行版的 virtio框架. Linux 是 hypervisor 展台.如我的 剖析 Li