📚 Part_301: VMCS and Key System Structures Explained
Introduction
In the blog series about building a minimal hypervisor in Rust using a Linux kernel module, we have already seen how to activate VMX, how the system manages memory, and how to allocate memory through a kernel module. The next step is to create and configure a Virtual Machine Control Structure (VMCS), a critical (and tedious) step. Before doing so, we will give a quick theoretical recap of a VMCS and discuss the key registers and tables used by the VMCS.
What is the VMCS
The VMCS (Virtual Machine Control Structure) is a data structure the CPU uses to store a virtual machine's state. It manages the transitions between VMX root and non-root operations and handles CPU operations while in non-root operation mode.
For a virtual machine with multiple logical (virtual) CPUs, each CPU uses a separate VMCS. A logical CPU associates a region in memory with each VMCS, which software can reference using a 64-bit physical address. This VMCS pointer must be 4 KB aligned.
A logical CPU can maintain several "active" VMCS structures concurrently, but only one is designated as the "current" VMCS. All VMX operations are performed on the current VMCS.
The VMCS is organized into three groups:
-
VMX Controls:
- VM-execution control fields: Control the CPU's behavior in VMX non-root operation.
- VM-entry control fields: Control the VM entry process into non-root operation.
- VM-exit control fields: Control the return from non-root to root operation.
- VM-exit Information Fields: Store information about the causes of VM exits.
- Host-State contains the host's state when operating in root mode, which will be loaded after a VM exit.
- Guest-State contains the guest's state, which will be loaded on a VM entry.
With this understanding, we can summarize the role of the VMCS in a simple hypervisor: the VM launch instruction will load the guest state, and on a VM exit (which happens immediately in our case), the host state will be reloaded. More advanced features, such as pausing, switching, and resuming virtual machines, are also available.
The Global Descriptor Table
The Global Descriptor Table (GDT) is a data structure that defines the characteristics of memory areas, called segments, which are used during program execution. The GDT is referenced by the GDTR CPU register and loaded using the LGDT instruction with a pointer to the GDT descriptor as an argument.
Each entry in the GDT is 8 bytes in size, with the first entry always being a null descriptor, and is accessed using a Segment Selector (a 16-bit value) that points to the corresponding segment descriptor. This segment descriptor contains essential information about the segment, such as its size, location, access permissions, and status.
The GDT contains various segment descriptors, including those for code segments, data segments, stack segments, and special descriptors like the Task State Segment (TSS) and LDT (Local Descriptor Table). The TSS manages task switching, while the LDT is an additional descriptor table that provides extra segmentation capabilities for specific tasks.
To optimize address translation time, the CPU provides segment registers, which hold the values of the segment selectors. These segment registers reduce the overhead of address translation when accessing memory. For any program to run, including virtual machines, the CPU must load at least the Code Segment (CS), Data Segment (DS), and Stack Segment (SS) registers.
In virtualization, the GDT is critical because it helps manage the segmentation context of both the host and each guest when the CPU switches between virtual machines. When the hypervisor is running, it uses the GDT to load the correct segment descriptors for the host. When a guest VM runs, its own set of descriptors is loaded, ensuring proper memory access control and isolation between the host and the guest. This isolation ensures that each VM cannot directly access the memory of another guest or the host, maintaining the security and stability of the system.
The GDT's role extends beyond segmentation to address space isolation, which is crucial for virtualization. It defines the base addresses and limits for segments, ensuring that a guest VM can only access its allocated memory. The VMCS works with the GDT to load the appropriate segment descriptors during VM entry and exit. During a VM Exit, the CPU saves the guest segment registers and loads the host segment registers from the GDT. During a VM Entry, this process is reversed: the CPU saves the host's segment registers and loads the guest's segment registers.
The GDT also defines access control and protection mechanisms through the privilege levels (Ring 0, Ring 1, Ring 2, Ring 3). By setting different access rights for different privilege levels, the GDT ensures that lower-level privileges (like user mode) cannot access higher-level privileges (like kernel mode), thus preventing potential vulnerabilities or illegal access.
In some advanced virtualized setups, the Task State Segment (TSS) can manage task switching between VMs, the host, and a guest. Although not commonly used in modern OS multitasking, the TSS can help save and restore CPU state during task switches in specific contexts.
In summary, the GDT is fundamental for managing memory segmentation in traditional environments and integral in virtualization. Managing the context for both the host and the guests enables the CPU to switch between VMs while maintaining memory isolation, access control, and efficient address translation.
The Interrupt Descriptor Table
The Interrupt Descriptor Table (IDT) is a data structure similar to the Global Descriptor Table (GDT) but handles interrupt and exception management. The IDT is stored in the IDTR (Interrupt Descriptor Table Register) and loaded using the LIDT instruction, which takes an IDT Descriptor Pointer as an argument. Each entry in the IDT, called a gate, is used to manage different types of interrupts. There are primarily two categories of interrupts: exceptions and interrupts. Exceptions occur during code execution due to errors (e.g., divide by zero, invalid opcodes). In such cases, storing the currently executing instruction is crucial and handled using a Trap Gate.
On the other hand, interrupts are usually asynchronous and not directly tied to code execution (e.g., hardware interrupts), requiring the saving of the address of the next instruction to resume execution. These are managed using Interrupt Gates, which specify an Interrupt Service Routine (ISR). Additionally, Task Gates in the IDT can refer to entries in the Global Descriptor Table (GDT) that point to a Task State Segment (TSS), which allows the CPU to switch tasks in response to hardware interrupts.
In virtualization, the IDT becomes even more crucial. While the guest operating system runs in guest mode, it uses its own IDT to manage interrupts. However, when the hypervisor needs to regain control, such as during a VM exit, the system must transition back to host mode. The VMCS (Virtual Machine Control Structure) is a key component in this transition, storing the virtual machine's state. The VMCS has specific fields like HostIdtrBase and HostIdtrLimit, which store the base address and the size of the host's IDT. These fields ensure that, during a VM exit, the host's IDT is loaded and used for interrupt management, allowing the hypervisor to manage interrupts and exceptions in the host environment rather than using the guest's IDT.
When setting up a virtual machine, the hypervisor must properly save and load the IDT for both the guest and the host. The guest's IDT is used when the VM is running, and the host's IDT is used when the CPU exits from the VM and resumes execution in the host context. By storing the IDT details in the VMCS, the hypervisor can seamlessly manage interrupt handling as the system transitions between guest and host modes. This functionality is critical for maintaining control over the virtual machine and ensuring the integrity of the interrupt handling system during virtualization. The IDT's role in virtualization is not just about managing interrupts but also ensuring that the correct interrupt handling mechanism is in place, depending on whether the system is running in guest or host mode. It helps prevent a guest VM from interfering with interrupt handling in the host environment, ensuring that a VM's interrupt requests are controlled and isolated.
Control Register
Control registers (CR0, CR3, CR4) manage various system operations, including memory management, protection mechanisms, and turning specific CPU features on or off (as we saw in the first blog post). These registers play a significant role in the CPU's overall execution and are essential in both protected mode and virtualized environments.
The CR0 register is a control register that governs fundamental processor behavior. It turns on or off essential features like paging and protection mechanisms. For instance, when the Paging (PG) bit is set in CR0, paging is enabled, which allows the processor to translate virtual addresses to physical ones. The Protection Enable (PE) bit in CR0 also enables protected mode, a feature crucial for preventing programs from directly accessing hardware resources. The CR3 register is used primarily for memory management. It holds the base address of the page directory used in paging, which is necessary for virtual memory translation. CR3 helps the CPU translate virtual addresses to physical addresses during memory access. The CR4 register is a more advanced control register that manages additional CPU features. For example, the PAE (Physical Address Extension) bit in CR4 allows the CPU to access more than 4 GB of physical memory in 32-bit mode, and the VMXE bit enables virtual machine extensions, which are crucial for virtualization.
In the virtualization context, the VMCS stores the values of these control registers for both the host and guest systems. When the hypervisor exits from guest execution and returns to the host, it must load the correct values of these registers to ensure that the host's environment is restored correctly. Likewise, the VMCS stores the corresponding control register values for the guest system. This system allows the hypervisor to maintain the guest's state while preserving its environment.
Conclusion
The VMCS serves as the backbone of virtualization, maintaining the CPU context for both the host and guest during transitions between them. At its core, the VMCS ensures that the CPU state, including control registers, segment selectors, GDT, IDT, and other essential values, is correctly saved and restored when switching between guest and host execution. This isolation between host and guest states enables effective memory separation and protection, allowing for the system's virtualization without interfering with the underlying host environment and between guests.
This memory separation is crucial for ensuring the guests' and host's integrity and security. When the CPU transitions into guest mode, it loads the guest's context, including its control registers, segment selectors, and any other necessary values, ensuring that the guest operates in its isolated memory space. Conversely, when transitioning back to host mode, the CPU loads the host's context, restoring the host environment exactly as before the VM exit. This isolation mechanism guarantees that the guest's execution cannot directly modify or interfere with the host's resources, thus creating a secure boundary between the two environments.
However, despite the simplicity of the theory behind the VMCS, the practical implementation is far more involved. A significant setup is required to configure all the mandatory VMCS fields, even in a minimal hypervisor. For instance, the hypervisor must handle the GDT (Global Descriptor Table) and IDT (Interrupt Descriptor Table) for both the host and guest, as well as segment selectors, control registers (CR0, CR3, CR4), and other crucial information. These fields are duplicated for both the host and the guest to ensure that the CPU's state is properly isolated for each during the VM transitions.
Additionally, the VMCS includes general fields related to the VM entry and exit conditions, such as the instruction pointer (RIP) and the stack pointer (RSP). The instruction pointer determines where the CPU should begin execution when transitioning into the guest or host. In contrast, the stack pointer ensures the correct stack is used during context switches.
Properly configuring these fields is essential for maintaining the correct execution flow after a VM entry or exit. Misconfiguring or forgetting one can be challenging to debug, generating a crash or complete system freeze without the possibility of collecting valuable information about the error.
In summary, while the theoretical understanding of the VMCS and its role in memory separation and CPU context management is straightforward, the actual implementation can be tedious. With this foundational knowledge, however, we can now implement a minimal hypervisor in Rust, using just a small amount of assembly code where necessary.