1. A brief history of operating sytems
- Single-application computers: one CPU, one program.
- Timesharing: one CPU, one program at a time.
- Virtual machines: one CPU, multiple programs (that all think they are running alone).
- Processes: abstract CPU, multiple programs (that all think they are running at the same time).
- Threads: two layers of processes, with lightweight processes running inside heavyweight processes.
- Virtualization: virtual machines running complete operating systems inside processes.
The last step can be seen as a return to the virtual machine days of yesteryear—possibly even as a precursor to going back to one CPU per program. Or it can be seen as an attempt to build fully recursive processes.
2. Why virtualize?
- Simulate hardware you don't own (e.g. SoftPC/SoftWindows/Virtual PC for Macs in the 1980's and 1990's).
- Simulate hardware you don't own any more.
- Share resources with full isolation (e.g. rented web servers).
- Run programs that expect incompatible OS environments (e.g. using VMWare to run Office under Linux).
- Run programs that you don't trust with access to the underlying hardware.
3. Virtualization techniques
The goal is to run a guest operating system on top of a host operation system so that the guest OS thinks it is running on bare hardware. There are basically two ways to do this: using an emulator or a hypervisor.
This is the simplest conceptually. We write a program (in C, say) that simulates all of the underlying physical hardware, including the CPU. CPU registers, the MMU, virtual memory, etc. are all represented using data structures in the program, and instruction execution involves a dispatch loop that calls appropriate procedures to update these data structures for each instruction.
Examples: bochs, SoftPC, many emulators for defunct hardware like Apple II's or old videogames.
Advantages: Runs anywhere, requires no support from host OS.
Disadvantage: Horrendously slow. When emulating old hardware, this is not necessarily a problem: A 2 GHz Pentium doing up to 4 instructions per clock cycle can do a pretty good job of faking a 1 MHz 6502 doing 2-3 clock cycles per instruction. But it's less convincing when emulating recent hardware.
A hypervisor or virtual machine monitor runs the guest OS directly on the CPU. (This only works if the guest OS uses the same instruction set as the host OS.) Since the guest OS is running in user mode, privileged instructions must be intercepted or replaced. This further imposes restrictions on the instruction set for the CPU, as observed in a now-famous paper by Popek and Goldberg published in CACM in 1974, "Formal requirements for virtualizable third generation architectures" (see http://portal.acm.org/citation.cfm?doid=361011.361073).
Popek and Goldberg identify three goals for a virtual machine architecture:
- Equivalence: The VM should be indistinguishable from the underlying hardware.
- Resource control: The VM should be in complete control of any virtualized resources.
- Efficiency: Most VM instructions should be executed directly on the underlying CPU without involving the hypervisor.
They then describe (and give a formal proof of) the requirements for the CPU's instruction set to allow these properties. The main idea here is to classify instructions into privileged instructions, which cause a trap if executed in user mode, and sensitive instructions, which change the underlying resources (e.g. doing I/O or changing the page tables) or observe information that indicates the current privilege level (thus exposing the fact that the guest OS is not running on the bare hardware). The former class of sensitive instructions are called control sensitive and the latter behavior sensitive in the paper, but the distinction is not particularly important.
What Popek and Goldberg show is that we can only run a virtual machine with all three desired properties if the sensitive instructions are a subset of the privileged instructions. If this is the case, then we can run most instructions directly, and any sensitive instructions trap to the hypervisor which can then emulate them (hopefully without much slowdown).
The bad news: Most CPU architectures contain sensitive but unprivileged instructions, known as critical instructions. For example, IA32 architecture allows unprivileged programs to read the Global and Local Descriptor Tables, so if the hypervisor is lying about the interrupt vectors the guest OS can find this out. (A more complete list of bad instructions on IA32 can be found at http://www.floobydust.com/virtualization/lawton_1999.txt.) So some mechanism is needed to trap these instructions.
3.2.1. Using breakpoints
One approach is to use the CPU's breakpoint mechanism to trap on critical instructions. This requires scanning code to be executed so we know where to put the breakpoints. The tricky part is that typically we don't have enough breakpoints to cover all critical instructions, so in practice we can only execute natively code in chunks, where we trap anything that escapes from the chunk we have covered (this is not as hard as it sounds, since we can use the virtual memory system to mark any page outside the current one as non-executable). This requires that when we switch to a new chunk we rescan it, adding quite a bit of overhead to executing straight-line code.
3.2.2. Using code rewriting
A more efficient method is to rewrite the code itself. If we replace every occurrence of a critical instruction with a system call, we can emulate the critical instruction directly without any sneakiness. We can do the same for all privileged instructions as well, which may slightly increase performance just using protection faults. The problem is that now the guest OS may notice that its code isn't what it thought it should be.
Fortunately, the virtual memory system again comes to our rescue: by marking each rewritten page as executable but not readable, any attempt by the guest OS to read a page can be trapped. We can then supply the data from the original page (which we presumably kept around somewhere).
3.2.3. Using paravirtualization
A third approach is to let the guest OS do its own code rewriting. Here we use a modified guest OS that replaces privileged instructions with explicit hypervisor calls. We still need to detect and trap any sensitive instructions, but the cost of doing so is likely to be small (if we are lazy, we can simply ignore the issue of critical instructions, since we have already given the game away by asking for a modified guest OS). This is the best approach if we can do it, but since it depends on modifying the guest OS, it doesn't work in general.
3.2.4. Using additional CPU support
The ultimate solution is to fix the CPU so that there are no critical instructions. We can't change the instruction set if we want to run old programs unmodified, so instead we have to expand the CPU to move control registers into virtual machines implemented in hardware (again, back to the past). This is done in recent Intel and AMD CPUs; a description of the Intel approach can be found [here http://www.intel.com/technology/itj/2006/v10i3/1-hardware/5-architecture.htm]. Such support allows for ring aliasing, where unprivileged code thinks it is running with higher privileges, and allows for executing many privileged instructions that control CPU state without faulting (because they now execute the fake state in the virtual machines, which can be modified without causing trouble). Quite a bit of work is still needed to translate operations on the fake machine to operations on the underlying machine; for operations that actually affect the system (I/O, changes to virtual memory), the CPU must trap to the hypervisor running on real hardware.
We've already mentioned some of the main applications. Broadly speaking, there are three main reasons to use a virtual machine:
- Emulating hardware or operating systems that would otherwise not be available.
- Timesharing with full OS isolation.
Timesharing mostly comes up with systems that expect to have full control of the machine. For example, web servers and database servers typically expect to be the only one running at a time. So if you want to rent out webserver space, it makes sense to split your single real server among multiple virtual machines that can be configured to the tastes of your various clients. This also provides isolation, always a good thing.
Isolation can also be an issue for programs that you don't trust. If you worry that your webserver can be compromised, running it inside a virtual machine prevents it from escaping and compromising the rest of your system; a VM thus acts as a perfect jail. Conversely, bad guys can use virtualization to produce near-perfect rootkits: having a compromised machine appear indistinguishable from an uncompromised machine is the definition of successful virtualization. Such techniques may also be used to subvert software-only DRM mechanisms.