Cloud computing 1.0 focuses on virtualization which today has become the foundation of cloud computing. Virtualization separates applications from the underlying operating systems (OSs) and hardware in traditional IT systems, and cloud computing relies on that split. While virtualization and cloud computing can both be used to build highly available and reliable runtime environments for applications, they differ in many ways.

The most apparent difference between the two is that virtualization provides only the Infrastructure as a Service (IaaS) model, while cloud computing provides other service models on top of IaaS. Virtualization is the foundation of cloud computing. It is important that we learn about this technology before delving into cloud technology.

Virtualization allows OSs and applications to run on virtual machines (VMs). This chapter describes how to create VMs and how virtualization works.

Virtualization Overview

What's Virtualization?

Virtualization is a technology that simulates hardware functionalities and creates multiple VMs on a physical server. Generally, applications need to run inside an OS, and only one OS can run on a physical server at a time. Virtualization allows VMs that reside on the same physical server to run independent OSs. This way, multiple OSs can concurrently run on the same physical server.

The essence of virtualization is to separate software from hardware by converting "physical" devices into "logical" folders or files.

Before virtualization, we can locate the physical devices running our applications in a real-world environment using an equipment list or physical configuration list. For example, we can see several components of a physical server, like CPUs, memory, hard disks, and network adapters.

Physical server components

Virtualization converts physical servers into logical folders or files. These folders or files can be divided into two parts: those that store VM configuration information, and those that store user data. The following shows part of the configuration file of a KVM VM. We can see the VM name and its CPU, memory, hard disk, and network settings.

Without virtualization, running multiple primary application programs in the same operating system of a physical server may cause runtime conflicts and performance bottlenecks. Running only one application on a dedicated server could solve these problems but will easily cause low resource utilization.

With virtualization, multiple VMs can run on a single physical server, and each VM can run an independent OS. This improves resource utilization. In addition, virtualization frees applications from being shackled to a single server by allowing dynamic VM migration across a cluster without impacting service continuity or user experience. Dynamic VM migration enables advanced features like high availability (HA), dynamic resource scheduling (DRS), and distributed power management (DPM), and brings a range of benefits for enterprise data centers, such as portability of workloads, server consolidation, fault tolerance, and lower OPEX and management costs.

A Brief History of Compute Virtualization

Virtualizing a physical server into multiple VMs is not a new technology in the IT industry. In fact, as early as 1964, "Big Blue" IBM started to virtualize mainframe computers. In 1961, IBM709 Machine realized a time-sharing system. The CPU is divided into a number of very short (1/100 sec) time slices, each performing a different task. By polling these time slices, you can virtualize or disguise a single CPU as multiple virtual CPUs, and make each virtual CPU appear to be running concurrently. This is the prototype of the VM. Later system360 machines all support time-sharing systems.

In 1972, IBM formally named the CTSS of the system370 machine a VM.
In 1990, IBM introduced the system390 machine, which supports logical partitioning. A CPU is divided into several parts (up to ten copies), and each CPU is independent. That is, a physical CPU can be logically divided into ten CPUs. Until IBM put the time-sharing system open source, the personal PC finally came to the beginning of x86 virtualization.
In 1999, VMware introduced the first x86 virtualization product. While VMware develops its own virtualization products, Lan Pratt and Keir Fraser from University of Cambridge in London developed Xen VMs in a research project called Xenoserver in the 1990s. As the core of the Xenoserver, Xen VMs manage and allocate system resources and provide necessary statistics functions. In those days, x86 processors did not have hardware support for the virtualization technology and therefore Xen was introduced based on the quasi-virtualization technology. To support the running of multiple VMs, Xen required a modified kernel. Xen was officially open sourced in 2002 to allow a global community of developers to contribute and improve the product. Xen 1.0 was officially released followed a short time later by Xen 2.0. Widespread adoption of the Xen hypervisor took place when Red Hat, Novell, and Sun all added the Xen hypervisor as their virtualization solution of choice.
In 2004, Intel's engineers began to develop hardware virtualization with Xen to provide software support for their next-generation processors.
In 2005, Xen 3.0 was officially released, which supports Intel's VT technology and IA64 architecture, allowing Xen VMs to run unmodified operating systems. In addition to Xen, Kernel-based Virtual Machine (KVM) is another famous virtualization technology, which was originally developed by Israeli startup Qumranet. KVMs were used as the VMs of their Virtual Desktop Infrastructure (VDI) products. To simplify development, KVM developers did not choose to write a new hypervisor. Instead, they loaded a new module based on the Linux kernel to make the Linux kernel become a hypervisor. For details about hypervisors, "Compute Virtualization Types."
In October 2006, after completing basic functions, dynamic migration, and optimization of main functions and performance, Qumranet officially announced the birth of KVM. Also in October 2006, the source code of the KVM module was officially accepted into the Linux kernel and became a part of the kernel source code. On September 4, 2008, Qumranet was acquired by Red Hat, Inc. for $107 million in cash. Red Hat is a famous Linux distribution vendor and a leading contributor to the kernel community. The acquisition means that Red Hat became the new owner of the KVM open source project. After the acquisition, Red Hat developed its own VM solution and began to replace Xen with KVM in its own products. In November 2010, Red Hat launched Red Hat Enterprise Linux 6 (RHEL 6), which integrated the latest KVM VM and deleted the Xen VM integrated in the RHEL 5.x series. From 2006 to 2010, traditional IT vendors launched their own virtualization products.
In 2007, HP launched Integrity VMs, and Microsoft added Hyper-V to Windows Server 2008 R2. As x86 virtualization became more and more popular, a lightweight virtualization technology was also being developed, that is, container technology. The concept of containers was started way back in 1979 with UNIX chroot.
In 2008, after years of development, Linux launched LXC. LXC stands for LinuX Containers and it is the first, most complete implementation of Linux container manager. It was implemented using cgroups and Linux namespaces. LXC was delivered in liblxc library and provided language bindings for the API in Python3, Python2, Lua, Go, Ruby, and Haskell. Contrast to other container technologies LXC works on vanila Linux kernel without requiring any patches. Today, the LXC project is sponsored by Canonical Ltd. and hosted here.
In 2013, the Docker container project was developed. Docker also used LXC at the initial stages and later replaced LXC with its own library called libcontainer. Unlike any other container platform, Docker introduced an entire ecosystem for managing containers. This includes a highly efficient, layered container image model, a global and local container registries, a clean REST API, a CLI, etc. At a later stage, Docker also took an initiative to implement a container cluster management solution called Docker Swarm.
In 2014, Rocket was launched, which is a much similar initiative to Docker started by CoreOS for fixing some of the drawbacks they found in Docker. CoreOS has mentioned that their aim is to provide more rigorous security and production requirements than Docker. More importantly, it is implemented on App Container specifications to be a more open standard.

Compute Virtualization Types

Before introducing compute virtualization types, let's learn some terms that are commonly used in virtualization.

First, a host machine is a physical computer that can run multiple VMs, and an OS installed and running on the host machine is a host OS. VMs running on a host machine are called guest machines. The OS installed on VMs is called a guest OS. The core of virtualization technology between the host OS and guest OS is a hypervisor, which is sometimes called Virtual Machine Manager (VMM).

In a physical architecture, a host has only two layers from bottom to top: hardware (host machine) and host OS. Applications are installed in the host OS. In a virtualization architecture, a host has more layers from bottom to top: hardware (host machine), hypervisor, guest machine, and guest OS. Applications are installed in the guest OS. Multiple guest machines can be created and run on a single host machine.

There are two types of hypervisors: Type 1 and Type 2. Many people categorize containers as the third type of hypervisor. Since containers are not discussed in this course, we will just focus on Type 1 and 2 hypervisors.

A Type 1 hypervisor is also called a bare-metal hypervisor. This type of hypervisor has direct access to hardware resources and does not need to access the host OS. The hypervisor can be seen as a customized host OS, which merely functions as VMM and does not run other applications. The hypervisor provides the following basic functions: Identify, capture, and respond to privileged CPU instructions or protection instructions sent by VMs (the privileged instructions and protection instructions will be described in the CPU virtualization section); schedule VM queues and return physical hardware processing results to related VMs. In other words, the hypervisor manages all resources and virtual environments. VMM can be seen as a complete OS born for virtualization to control all resources (CPUs, memory, and I/O devices). VMM also provisions VMs for running the guest OS. Therefore, VMM also supports creation and management of virtual environments. The virtualization products that use Type 1 hypervisors include VMWare ESX Server, Citrix XenServer, and FusionCompute.

Type 1 hypervisors have the following advantages and disadvantages:

Advantages: VMs can run different types of guest OSs and applications independent of the host OS.
Disadvantages: The kernel of the virtualization layer is hard to develop.

A Type 2 hypervisor is also called a hosted hypervisor. Physical resources are managed by the host OS (for example, Windows or Linux). VMM provides virtualization services and functions as a common application in the underlying OS (for example, Windows or Linux). VMs can be created using VMM to share underlying server resources. VMM obtains resources by calling the host OS services to virtualize the CPUs, memory, and I/O devices. After a VM is created, VMM usually schedules the VM as a process of the host OS. The virtualization products that use Type 2 hypervisors include VMware Workstation and Virtual PC.

Type 2 hypervisors have the following advantages and disadvantages:

Advantages: They are easy to implement.
Disadvantages: Only the applications supported by the host OS can be installed and used. The performance overheads are high.

Unlike a Type 1 hypervisor, a Type 2 hypervisor is only a program in the host OS. All hardware

resources are managed by the host OS.

Both Type 1 and Type 2 hypervisors possess the partitioning, isolation, encapsulation, and hardware independence features.

Partitioning: indicates the VMM capability of allocating server resources to multiple VMs. Each VM can run an independent OS (same as or different from the OSs running on other VMs on the same server) so that multiple applications can coexist on one server. Each OS gains access only to its own virtual hardware (including the virtual NIC, virtual CPUs, and virtual memory) provided by VMM. The partitioning feature solves the following problems:

Resource quotas are allocated to each partition to prevent resource overuse by virtualization
Each VM has an independent OS

Isolation: Multiple VMs created in a partition are logically isolated from each other. The isolation feature solves the following problems :

Even if one VM crashes due to an OS failure, application breakdown, or driver failure it should not affect the others on the same server.
It seems that each VM locates at an independent physical machine. If a VM is infected with worms or viruses, the worms and viruses are isolated from other VMs.

The isolation feature allows you to control resources to implement performance isolation. That is, you can specify the minimum and maximum resource usages for each VM to prevent a VM from exclusively occupying all resources in the system. Multiple workloads, applications, or OSs can run on a single machine, without causing problems such as application conflicts and DLL conflicts mentioned in our discussions about the limitations of the traditional x86 architecture.

Encapsulation: Each VM is saved as a group of hardware-independent files, including the hardware configuration, BIOS configuration, memory status, disk status, and CPU status. You can copy, save, and move a VM by copying only a few files. Let's use VMware Workstation as an example. You can copy a set of VM files to another computer where VMware Workstation is installed and restart the VM. Encapsulation is the most important feature for VM migration and virtualization. This is because encapsulating a VM as a set of hardware-independent files makes VM migration and hot swap possible.
Hardware independence: After a VM is encapsulated into a group of files, the VM is completely independent from its underlying hardware. You can migrate the VM by copying the VM device file, configuration file, or disk file to another host. Because the underlying hardware device is shielded by VMM running on it, the migration can be successful as long as the same VMM running on the target host as that on the source host, regardless of the underlying hardware specifications and configuration. This is similar to editing a Word file by using Office 2007 on computer A that runs a Windows 7 system and then copying the Word file to computer B that runs a Windows 10 system. You only need to check whether Office 2007 is installed on computer B and do not need to check the CPU model or memory size of the underlying hardware.

Compute Virtualization

Compute virtualization includes CPU virtualization, memory virtualization, and I/O virtualization.

CPU Virtualization

CPU hierarchical protection domains

Before we talk about CPU virtualization, let's have a brief look at the hierarchical protection domains of CPUs, often called protection rings. There are four rings: Ring 0, Ring 1, Ring 2, and Ring 3, which is a hierarchy of control from the most to least privilege. Ring 0 has direct access to the hardware. Generally, only the OS and driver have this privilege. Ring3 has the least privilege. All programs have the privilege of Ring 3. To protect the computer, some dangerous instructions can only be executed by the OS, preventing malicious software from randomly calling hardware resources. For example, if a program needs to enable a camera, the program must request a Ring 0 driver to do that on its behalf. Otherwise, the operation will be rejected.

The OS on a common host sends two types of instructions: privileged instructions and

common instructions.

Privileged instructions are instructions used to manipulate and manage key system resources. These instructions can be executed by programs of the highest privilege level, that is, Ring 0.
Common instructions can be executed by programs of the common privilege level, that is, Ring 3.

In a virtualized environment, there is another special type of instruction called sensitive instruction. A sensitive instruction is used for changing the operating mode of a VM or the state of a host machine. The instruction is handled by VMM after a privilege instruction that originally needs to be run in Ring 0 in the guest OS is deprived of the privilege.

The virtualization technology was firstly applied in IBM mainframes. How does mainframes implement CPU sharing? First, let's learn about the CPU virtualization methods of mainframes. The CPU virtualization methods used by mainframes are Deprivileging and Trap-and-Emulation, which are also called classical virtualization methods. The basic principles are as follows: The guest OS runs at the non-privilege level (that is, deprivileging) and VMM runs at the highest privilege level (that is, fully controlling system resources).

A problem arises: How can a privileged instruction sent by the guest OS of VM performed? Because the privilege of all VM systems has been removed, Trap-and-Emulation takes effect. After the privilege of the guest OS is removed, most instructions of the guest OS can still run on hardware. Only when a privileged instruction arrives, it will be sent to VMM for emulation. VMM, in place of the VM, sends the privileged instruction to the real hardware CPU. Combining the classical CPU virtualization methods with the timer interrupt mechanism of the original OS can solve problems in CPU virtualization. For example, VM 1 sends privileged instruction 1 to VMM. An interrupt is triggered. VMM traps privileged instruction 1 sent by VM 1 for emulation, and then converts the instruction into privileged instruction 1' of the CPU. VMM schedules privileged instruction 1' to a hardware CPU for execution, and sends the result to VM 1, as shown in Figure 2-3. When VM 1 and VM 2 send a privileged instruction to VMM at the same time, the instruction is trapped into emulation, and VMM performs unified scheduling. Instruction 1' is executed and then instruction 2' is executed, as shown in Figure 2-4. The CPU virtualization function is successfully implemented by using the timer interrupt mechanism and the Deprivileging and Trap-and-Emulation methods.

Unified scheduling of all instructions

Special instructions

Why is the timer interruption mechanism required? If an emergency occurs outside the system, inside the system, or in the current program, the CPU immediately stops the running of the current program, and automatically switches to the corresponding processing program (interruption service program). After the processing is complete, the CPU returns to the original program. This process is called program interruption. For example, when you are watching a video, the instant messaging program suddenly displays a message, which triggers the interruption mechanism. The CPU will pause the video playing process and execute the instant messaging process. After processing the instant messaging operation, the CPU continues to execute the video playing process. Of course, the interruption time is very short and users are unaware of the interruption.

As the performance of x86 hosts is increasingly enhanced, applying virtualization technologies to the x86 architecture becomes a major problem for implementing x86 server virtualization. At this time, people usually think of the use of the CPU virtualization technology on mainframes. Can the CPU virtualization method used on mainframes be transplanted to x86 servers? The answer is no. But why? To answer this question, we need to understand the differences between the x86-architecture CPU and mainframe CPU.

Mainframes (including subsequent midrange computers) use the PowerPC architecture, that is, a reduced instruction set computer (RISC) architecture. In the CPU instruction set of the RISC architecture, sensitive instructions specific to VMs are included in privileged instructions. After the privilege of the VM OS is removed, the privileged instructions and sensitive instructions can be trapped, emulated, and executed. Because the privileged instructions include sensitive instructions, the CPU with the RISC architecture can properly use the Deprivileging and Trap-and-Emulation methods. However, CPU instruction sets in the x86 architecture are CISC instruction sets, which are different from RISC instruction sets.

RISC instruction set

CISC instruction set

As shown in the preceding figures, the privileged instructions and sensitive instructions in CISC instruction set do not completely overlap. Specifically, 19 sensitive instructions in the CISC instruction set based on the x86 architecture are not privileged instructions. These sensitive instructions run in the Ring 1 user mode of the CPU. What problems will this bring about? Apparently, when a VM sends one of the 19 sensitive instructions, the instruction cannot be captured by VMM by means Trap-and-Emulation because it is not a privileged instruction. Therefore, x86 servers cannot be virtualized using the Deprivileging and Trap-and-Emulation methods. This problem is called a virtualization vulnerability. Since mainframe-based CPU virtualization methods cannot be directly transplanted to the x86 platform, what CPU virtualization methods should the x86 platform use? IT architects came up with three alternative techniques. They are full virtualization, paravirtualization, and hardware-assisted virtualization (proposed by hardware vendors).

Full virtualization

The classical virtualization methods are not suitable for the x86-based CPUs. The root cause is that the 19 sensitive instructions beyond the privileged instructions. CPU virtualization problems can be solved only after these sensitive instructions can be identified, trapped, and emulated by VMM. But how can these 19 instructions be identified?

A fuzzy identification method can be used. All OS requests sent by VMs are forwarded to VMM, and VMM performs binary translation on the requests. When VMM detects privileged or sensitive instructions, the requests are trapped into VMM for emulation. Then, the requests are scheduled to the CPU privilege level for execution. When VMM detects program instructions, the instructions are executed at the CPU non-privilege level. This technique is called full virtualization because all request instructions sent by VMs need to be filtered. The implementation of full virtualization. Full virtualization was first proposed and implemented by VMware. VMM translates the binary code of the VM OS (guest OS) without modifying the VM OS. VMs have high portability and compatibility. However, binary translation causes the performance overhead of VMM. On one hand, full virtualization has the following advantages: The VM OS does not need to be modified. VMs are highly portable and compatible, and support a wide range of OSs. On the other hand, it has the following disadvantages: Modifying the guest OS binary code during running causes large performance loss and increases the VMM development complexity. Xen developed the paravirtualization technique, which compensates for the disadvantages of full virtualization.

Full virtualization

Paravirtualization

The virtualization vulnerability comes from the 19 sensitive instructions. If we can modify the VM OS (guest OS) to avoid the virtualization vulnerability, then the problem can be solved.

If the guest OS can be modified to be able to aware that it is virtualized, the VM OS uses the Hypercall to replace sensitive instructions in the virtualization with the hypervisor layer to implement virtualization. Non-sensitive instructions such as privileged and program instructions are directly executed at the CPU non-privilege level. Paravirtualization has the following advantages: Multiple types of guest OSs can run at the same time. Paravirtualization delivers performance similar to that of the original non-virtualized system. Its disadvantages are as follows: The host OS can be modified only for open-source systems, such as Linux. Non-open-source systems, such as Windows, do not support paravirtualization. In addition, the modified guest OS has poor portability.

Paravirtualization

Hardware-assisted virtualization

In full virtualization and paravirtualization, the physical hardware does not support virtualization identification by default. If physical CPUs support virtualization and are able to identify sensitive instructions, it will be a revolutionary change to CPU virtualization.

Fortunately, the CPUs of mainstream x86 hosts support hardware virtualization technologies, for example, Intel's Virtualization Technology (VT-x) CPU and AMD's AMD-V CPU. Intel's VT-x and AMD's AMD-V both target privileged instructions with a new CPU execution mode feature that allows VMM to run in a new ROOT mode below Ring 0. Privileged and sensitive calls are set to automatically trap to the hypervisor, removing the need for either full virtualization or paravirtualization. Hardware-assisted virtualization is used to solve virtualization vulnerabilities, simplify VMM software, and eliminate the need for paravirtualization or binary translation.

Hardware-assisted virtualization

Memory Virtualization

Memory virtualization is another important type of compute virtualization besides CPU virtualization. So why does CPU virtualization lead to memory virtualization?

With CPU virtualization, VMs running on top of the VMM layer have replaced physical hosts to run applications. Multiple VMs can run on the same host. A host usually has one or more memory modules. How can memory resources be allocated to multiple VMs properly? Memory virtualization was introduced to address this issue. One problem with memory virtualization is how to allocate memory address space. Generally, a physical host allocates memory address space as follows:

The memory address starts from the physical address 0.
The memory address space is allocated continuously.

However, after virtualization is introduced, the following problems occur: There is only one memory address space whose physical address is 0. Therefore, it is impossible to ensure that the memory address space of all VMs on the host start from the physical address 0. On the other hand, allocating contiguous physical addresses to VMs leads to low memory utilization and inflexibility.

Memory virtualization was introduced to solve problems concerning memory sharing and dynamical allocation of memory addresses. Memory virtualization is a process of centrally managing the physical memory of a physical machine and aggregating the physical memory into a virtualized memory pool available to VMs. Memory virtualization creates a new layer of address spaces, that is, the address spaces of VMs. The VMs are made to believe that they run in a real physical address space when in fact their access requests are relayed by VMM. VMM stores the mapping between guest machine address spaces and physical machine address spaces

Memory virtualization

Memory virtualization involves the translation of three types of memory addresses: VM memory address (VA), physical memory address (PA), and machine memory address (MA). The following direct address translation path must be supported so that multiple VMs can run a physical host: VA (virtual memory) → PA (physical memory) → MA (machine memory). The VM OS controls the mapping from the virtual address to the physical address of the customer memory (VA → PA). However, the VM OS cannot directly access the machine memory. Therefore, the hypervisor needs to map the physical memory to the machine memory (PA → MA).

We can use an example to explain the difference between MA and PA. If a server has a total of sixteen 16-GB memory bars, then its PA is 256 GB, and MA is sixteen memory bars distributed across different memory slots.

I/O Virtualization

With compute virtualization, a large number of VMs can be created on a single host, and these VMs all need to access the I/O devices of this host. However, I/O devices are limited. I/O device sharing among multiple VMs requires VMM. VMM intercepts access requests from VMs to I/O devices, simulates I/O devices using software, and responds to I/O requests. Thisway, multiple VMs can access I/O resources concurrently. I/O virtualization can be implemented in the following methods: full virtualization, paravirtualization, and hardware-assisted virtualization. Hardware-assisted virtualization is the mainstream technology for I/O virtualization.

Full virtualization

VMM virtualizes I/O devices for VMs. When a VM initiates an I/O request to an I/O device, VMM intercepts the request sent by the VM, and then sends the real access request to the physical device for processing. No matter which type of OS is used by the VM, the OS does not need to be modified for I/O virtualization. Multiple VMs can directly use the I/O device of the physical server. However, VMM needs to intercept I/O requests delivered by each VM in real time and emulates the request to a real I/O device. Real-time monitoring and emulation are implemented by software programs on the CPU, which causes severe performance loss to the server.

Paravirtualization
Unlike full virtualization, paravirtualization needs a privileged VM.

Paravirtulization requires each VM to run a frontend driver. When VMs need to access an I/O device, theVMs send I/O requests to the privileged VM through the frontend driver, and the backend driver of the privileged VM collects the I/O request sent by each VM. Then, the backend driver processes multiple I/O requests by time and by channel. The privileged VM runs the physical I/O device driver and sends the I/O request to the physical I/O device. After processing the request, the I/O device returns the processing result to the privileged VM. VMs send I/O requests to a privileged VM and then the privileged VM accesses a real I/O device. This reduces the performance loss of VMM. However, the VM OS needs to be modified. Specifically, the I/O request processing method of the OS needs to be changed so that all the I/O requests can be sent to the privileged VM for processing. This requires that the VM OS can be modified (usually Linux).

Xen architecture

Domain 0 is a privileged VM, and Domain U is a user VM. The device information of all user VMs is stored in the XenSToRe of the privileged VM Domain0. The XenBus (a paravirtualization driver developed for Xen) in the user VM communicates with the XenSToRe of the Domain 0 to obtain the device information and load the frontend driver corresponding to the device. When the user VM sends an I/O request, the frontend device driver forwards all data to the backend driver through the interface. The backend driver processes the data of the I/O request by time and by channel. Finally, the physical I/O device driver of Domain 0 sends the I/O request to the physical I/O device.

Let's take an example to compare full virtualization and paravirtualization. In full virtualization, VMM acts as an investigator. It collects and summarizes opinions and requests of each customer (VM). In paravirtualization, an opinion receiving box (that is, a privileged VM) is prepared and each customer puts opinions and requests into the box, and VMM centrally processes the opinions and requests. Paravirtualization significantly reduces the performance loss of VMM and therefore delivers better I/O performance. Full virtualization and paravirtualization have similarities: VMM is responsible for I/O access processing, which causes performance loss when VMs access I/O devices.

Hardware-assisted virtualization

Different from the preceding two methods, hardware-assisted virtualization directly installs the I/O device driver in the VM OS without any change to the OS. This method is equivalent to traditional PC OS access to hardware. Therefore, the time required for a VM to access the I/O hardware is the same as that for a traditional PC to access the I/O hardware. In the preceding example, hardware-assisted virtualization is like an intelligent information collection and processing platform. Users' requests can be directly submitted to the platform and the platform automatically processes the requests. Therefore, hardware-assisted virtualization outperforms full virtualization and paravirtualization in terms of I/O performance. However, hardware-assisted virtualization requires special hardware support.

Mainstream Compute Virtualization

CPU virtualization, memory virtualization, and I/O virtualization can be implemented to enable the reuse of physical resources. Multiple virtual servers can run on a physical host at the same time, and each virtual server can run different workloads. This improves hardware utilization. In addition, everything about a virtual server can be packed into a single file or folder. This breaks the tight coupling between software and hardware and allows VMs to migrate across hosts and even data centers, improving the reliability of workloads running on VMs. In cloud computing, we mainly use virtualization to implement IaaS cloud services.

There are three cloud service models: IaaS, PaaS, and SaaS. Some PaaS and SaaS services are implemented based on virtualization, and some are implemented based on physical hardware and distributed computing.

Let's use the first Chinese hard science fiction movie "The Wandering Earth" as an example, which shocked many people owing to its vivid pictures. The movie pictures were rendered using the rendering solution of the Huawei public cloud (HUAWEI CLOUD) that involves multiple products. SD rendering can be implemented by the C3 ECS and other cloud services. The C3 ECS uses the virtualization technology at the bottom layer. Panoramic rendering can be implemented by the BMS and other cloud services. The BMS background uses the real physical server instead of the virtualization technology.

Cloud computing is a business model that provides users with IT services anytime anywhere. Virtualization is an important technical means for cloud computing implementation.

There are many mainstream virtualization technologies. Generally, open-source and closed-source are used for classification. Open-source technologies include KVM and Xen. Closed-source virtualization technologies include Microsoft Hyper-V, VMware vSphere, and Huawei FusionSphere

Open-source technologies are free of charge and can be used anytime. Users can customize some special requirements based on open-source code. Open-source technologies have high requirements on users' technologies. Once a problem occurs in the system, the system recovery strongly relies on the administrator's skillset and experience. In closed-source technologies, users cannot view or customize source code. Closed-source virtualization products are generally not free of charge and can be used out of the box. If a system problem occurs, vendors provide all-round support.

For users, it is meaningless to determine which is better between open-source or closed-source virtualization technologies, but determining their respective application scenarios makes senses.

In open-source virtualization technologies, Xen is on a par with KVM. KVM is full virtualization, while Xen supports both paravirtualization and full virtualization. KVM, a module in the Linux kernel, is used to virtualize CPUs and memory. It is a process of the Linux OS. Other I/O devices (such as NICs and disks) need to be virtualized by QEMU. Different from KVM, Xen directly runs on hardware, and VMs run on Xen. VMs in Xen are classified as the privileged VM (Domain 0) that has the permission to directly access hardware and manage other VMs (for example, Domain U). Domain 0 must be started before other VMs. Domain U is a common VM and cannot directly access hardware resources. All operations on Domain U must be forwarded to Domain 0 through frontend and backend drivers. Domain 0 completes the operations and returns the results to Domain U.

KVM

Huawei virtualization products earlier than the 6.3 version are developed based on Xen. In 6.3 and later, they are developed based on Kernel-based Virtual Machine (KVM).

KVM is a Type-II full virtualization solution. It is a Linux kernel module. A physical machine with a Linux kernel module installed can function as a hypervisor, which does not affect the other applications running on the Linux OS. Each VM is one or more processes. You can run the kill command to kill the processes.

After the KVM module is installed in a common Linux OS, three running modes are added:

Guest Mode: VMs, including their CPUs, memory, and disks, run in a restricted CPU mode.
User Mode: The quick emulator (QEMU) typically runs in this mode. QEMU emulates I/O requests.
Kernel Mode: In this mode, the hardware can be operated. When the guest OS executes an I/O operation or privileged instruction, a request needs to be submitted to the user mode, and then the user mode initiates a hardware operation request to the kernel mode again to operate the hardware.

A KVM system consists of three parts: KVM kernel module, QEMU, and management tool. The KVM kernel module and QEMU are the core components of KVM

KVM architecture

Other virtualization products use similar architectures. The KVM kernel module is the core of a KVM VM. This module initializes the CPU hardware, enables the virtualization mode, runs the guest machine in the VM mode, and supports the running of the virtual client.

KVM is composed of two kernel modules, a common module (kvm.ko) and a processor specific module (kvm-intel.ko or kvm-amd.ko). kvm.ko implements the virtualization functions. By itself, KVM does not perform any emulation. Instead, it exposes the /dev/kvm interface, which a userspace host can then use to create vCPUs, allocate address space for virtual memory, read and write vCPU registers, and run vCPUs. kvm.ko only provides CPU and memory virtualization. However, a VM requires other I/O devices such as NICs and hard disks besides CPUs and memory. QEMU is required to implement other virtualization functions. Therefore, the KVM kernel module and QEMU form a complete virtualization technology.

QEMU was not a part of KVM. It was a universal open-source virtualization emulator that uses pure software to implement virtualization. The guest OS considers that it is interacting with hardware. Actually, QEMU is interacting with hardware. This means that all interactions with the hardware need to pass through QEMU. Therefore, the simulation performance delivered by QEMU is low. QMEU is able to simulate CPUs and memory. In KVM, only QEMU is used to simulate I/O devices. KVM developers reconstructed QEMU to create QEMU-KVM.

In QEMU-KVM, KVM runs in the kernel space, and QEMU runs in the user space. When the guest OS issues instructions, the CPU- and memory-related instructions call the /dev/kvm interface through /ioctl in QEMU-KVM. In this way, the instructions are sent to the kernel module. From the perspective of QEMU, this is done to accelerate virtualization. Other I/O operations are implemented by the QEMU part in QEMU-KVM. KVM and QEMU form the complete virtualization technology.

In addition to virtualization of various devices, QEMU-KVM provides native tools for creating, modifying, and deleting VMs. However, Libvirt is the most widely used tool and API for managing KVM VMs.

Libvirt is an open-source project and is a powerful management tool. It is able to manage virtualization platforms such as KVM, Xen, VMware, and Hyper-V. Libvirt is an API developed using the C language. APIs developed using other languages, such as Java, Python, and Perl, can call the Libvirt API to manage virtualization platforms. Libvirt is used by many applications. In addition to the virsh command set, Virt-manager, Virt-viewer, and Virt-install can manage KVM VMs through Libvirt.

In cloud computing, there are various hypervisors. Each hypervisor has its own management tool, and parameters are complex and difficult to use. Hypervisors are not unified, and there is no unified programming interface to manage them, which severely affects the cloud computing environment. With Libvirt, it can connect to various hypervisors, such as KVM and Xen, and provide APIs in various languages. Libvirt serves as the middle layer between the management tool and hypervisor and is completely transparent to upper-layer users.

QEMU is an emulation software tool for implementing I/O virtualization. It has poor emulation performance. For example, if QEMU is used to simulate a NIC of a Windows VM, the NIC rate displayed on the system is only 100 Mbit/s. It cannot meet the high NIC rate requirements of some applications. A new technology, Virtio, was introduced. In Windows virtualization, using Virtio can increase the NIC rate of a Windows VM to 10 Gbit/s.

Let's see how VM disk operations are performed without Virtio.

Default I/O operation process

A disk device of a VM initiates an I/O operation request.
I/O Trap Code (I/O capture program) in the KVM module captures the I/O operation request, performs corresponding processing, and then puts the processed request into the I/O shared page.
The KVM module notifies QEMU that a new I/O operation request is placed in the shared page.
After receiving the notification, QEMU obtains the detailed information about the I/O operation request from the shared page.
QEMU simulates the request and calls the device driver running in kernel mode based on the request information to perform the real I/O operation.
The I/O operation is then performed on physical hardware through the device driver.
QEMU returns the operation result to the shared page and notifies the KVM module that the I/O operation is complete.
I/O Trap Code reads the returned result from the shared page.
I/O Trap Code returns the operation result to the VM.
The VM returns the result to the application that initiated the operation.

In steps 2, 3, and 7, KVM does not make any modification on the I/O operation except for capturing the request and sending the notification. The Virtio technology was developed to simply this procedure.

If Virtio is used, the procedure is as follows:

I/O operation process with Virtio used

The VM initiates an I/O operation request.
The I/O operation request is not captured by the I/O capture program. Instead, the request is stored in the ring buffer between the frontend and backend drivers. At the same time, the KVM module notifies the backend driver.
QEMU obtains the detailed information about the operation request from the ring buffer.
The backend driver directly calls the actual physical device driver to perform the I/O operation.
The operation is completed by the device driver.
QEMU returns the operation result to the ring buffer, and the KVM module notifies the frontend driver.
The frontend driver obtains the operation result from the ring buffer.
The frontend driver returns the result to the application that initiated the operation.

The advantages of Virtio are as follows:

Saves the hardware resources required for QEMU emulation.
Reduces the number of I/O request paths and improves the performance of virtualization devices.

Virtio has some disadvantages. For example, some old or uncommon devices cannot use Virtio but can only use QEMU.

FusionCompute

Huawei FusionSphere virtualization suite is a leading virtualization solution. This solution significantly improves data center infrastructure efficiency and provides the following benefits for customers:

Improves infrastructure resource utilization data centers.
Significantly accelerates service rollout.
Substantially reduces power consumption in data centers.
Provides rapid automatic fault recovery for services, decreases data center costs, and increases system runtime by leveraging high availability and powerful restoration capabilities of virtualized infrastructure.

The FusionSphere virtualization suite virtualizes hardware resources using the virtualization software deployed on physical servers, so that one physical server can function as multiple virtual servers. It consolidates existing VMs on light-load servers to maximize resource utilization and release more servers to carry new applications and solutions.

FusionCompute is the cloud OS software in the FusionSphere virtualization suite and a mandatory component. It virtualizes hardware resources and centrally manages virtual resources, service resources, and user resources. It virtualizes compute, storage, and network resources using the virtual computing, virtual storage, and virtual network technologies. It centrally schedules and manages virtual resources over unified interfaces. FusionCompute provides high system security and reliability and reduces the OPEX, helping carriers and enterprises build secure, green, and energy-saving data centers.

FusionCompute consists of two parts: Computing Node Agent (CNA) and Virtual Resource Manager (VRM). In addition to CNA and VRM, Unified Virtualization Platform (UVP) is a unified virtualization platform developed by Huawei. UVP is a hypervisor, like KVM and Xen. The FusionCompute hypervisor adopts the bare-metal architecture and can run directly on servers to virtualize hardware resources. With the bare-metal architecture, FusionCompute delivers VMs with almost server-level performance, reliability, and scalability.

The architecture of FusionCompute is similar to that of KVM. VRM functions as the management tool of KVM. Administrators and common users can manage and use FusionCompute on the GUI-based portal of VRM. VRM is based on the Linux OS. Therefore, many Linux commands can be used after you log in to VRM.

VRM provides the following functions:

Manages block storage resources in a cluster.
Manages network resources, such as IP addresses and virtual local area network (VLAN) IDs, in a cluster and allocates IP addresses to VMs.
Manages the lifecycle of VMs in a cluster and distributes and migrates VMs across compute nodes.
Dynamically scales resources in a cluster.
Implements centralized management of virtual resources and user data and provides elastic computing, storage, and IP address services.
Allows O&M personnel to remotely access FusionCompute through a unified web UI to perform O&M on the entire system, such as resource monitoring, resource management, and resource report query.

CNA is similar to the QEMU+KVM module in KVM. CNA provides the virtualization function. It is deployed in a cluster to virtualize compute, storage, and network resources in the cluster into a resource pool for users to use. CNA is also based on the Linux OS.

CNA provides the following functions:

Provides the virtual computing function.
Manages the VMs running on compute nodes.
Manages compute, storage, and network resources on compute nodes.

CNA manages VMs and resources on the local node. VRM manages clusters or resources in the resource pool. When users modify a VM or perform other VM lifecycle operations on VRM, VRM sends a command to the CNA node. Then, the CNA node executes the command. After the operation is complete, CNA returns the result to VRM, and VRM records the result in its database. Therefore, do not modify VMs or other resources on CNA. Otherwise, the records in the VRM database may be inconsistent with the actual operations.

In addition to Huawei's hardware products, FusionCompute also supports other servers based on the x86 hardware platform and is compatible with multiple types of storage devices, allowing enterprises flexibly choose appropriate devices. A cluster supports a maximum of 64 hosts and 3000 VMs. FusionCompute provides comprehensive rights management functions, allowing authorized users to manage system resources based on their specific roles and assigned permissions.

This course uses FusionCompute to experiment virtualization functions and features. For details, see the corresponding experiment manual.

After the installation is completed based on the experiment manual, we will verify the following items:

1. Is FusionCompute 6.3 developed based on KVM?

2. If it is, does it use QEMU and Libvirt?

Ref : https://e.huawei.com/en/talent/#/resources

Little Learn

Introduction to Compute Virtualization

Virtualization Overview

What's Virtualization?

A Brief History of Compute Virtualization

Compute Virtualization Types

Compute Virtualization

CPU Virtualization

I/O Virtualization

Mainstream Compute Virtualization

FusionCompute