MIT 6.033 CSE Operating System
Course Note for Operating System of MIT 6.033 Computer System Engineering covers common *design patterns* to limit *complexity* and techniques, like *virtualization* and *abstraction*, to enforce *modularity*.
MIT 6.033 (Computer System Engineering) covers 4 parts: Operating Systems, Networking, Distributed Systems, and Security.
This is the course note for Part I: Operating Systems. And in this section, we mainly focus on:
- How common design patterns in computer system — such as abstraction and modularity — are used to limit complexity.
- How operating systems use virtualization and abstraction to enforce modularity.
In this section, we talk about what is complexity in computer systems, and how to mitigate it.
A system is a set of interconnected components that has an expected behavior observed at the interface with its environment.
So we say that a system has complexity, which limits what we can build. However, complexity can be mitigated with design patterns, such as modularity and abstration.
Nowadays, we usually enforce modularity by Client/Server Model, or C/S Model, where two modules reside on different machines and communicate with RPCs.
In this section, we talk about naming, which allows modules to communicate.
Naming is that a name can be resolved to the entity it refers to. Therefore, it allows modules to interact, and can help to achieve goals such as indirection, user-friendliness, etc.
The design of a naming scheme has 3 parts: name, value, and look-up algorithm.
One great case of naming scheme is Domain Name System (DNS), which illustrates principles such as hierarchy, scalability, delegation and decentralization. Especially, the hierarchical design of DNS let us scale up to the Internet.
Virtual Memory is a primary technique that uses Memory Management Unit (MMU) to translate virtual address into physical address by using page tables.
To enforce modularity, the operating system(OS) kernel checks the following 3 bits:
| Name | Description |
|---|---|
| User/Supervisor (U/S) bit | if the program allowed to access the address |
| Present (P) bit | if the page currently in memory |
| User/Kernel (U/K) bit | whether the operation is in user mode or kernel mode |
These 3 bits let the OS know when to trigger page faults, and if the access triggers an exception, the OS kernel will first switch to kernel mode and then execute the corresponding exception handler before switching back to user mode.
To deal with performance issues, the Operating Systems introduce two mechanisms: hierarchical page table and cache. The hierarchical(multilevel) page table reduces the memory overhead associated with the page table, at the expense of more table look-ups. And cache, also known as Translation Lookaside Buffer (TLB), stores recent translations of virtual memory to physical addresses to enable faster retrieval.
OS enforces modularity by virtualization and abstraction. On resources that can be virtualized, such as memory, OS uses virtualization. And for those components that are difficult to virtualize such as disk and network, OS presents abstration.
Let’s virtualize communication links - the bouded buffers.
But first, we need Lock, which is a protecting mechanism that allows only one CPU to execute a piece of code at a time to implement atomic actions. If two CPUs try to acquire the same lock at the same time, only one of them will succeed and the other will block until the first CPU releases the lock.
Implementing locks is possible by the support of a special hardware called controller that manages access to memory.
acquire(lock): do: r = 1 XCHG r, lock while r == 1
release(lock): lock = 0A bounded buffer is a buffer that has (up to) N slots and allows concurrent programs to send/receive messages.
A bounded buffer with lock may deal with race condition, therefore, we need to decide where to put locks:
- coarse-grained locking is easy to maintain correctness, but it will lead to bad performance;
- fine-grained locking improves performance, but it may cause inconsistent state;
- multiple locking requires that locks are acquired in the same order, otherwise the dead lock may happen.
In addition, bounded buffer with lock is yet another example of virtualization, which means any of senders/receivers think it has full access to the whole buffer.
Let’s virtualize processors - the threads.
Thread is a virtual processor and has 3 states:
- RUNNING (actively running)
- RUNNABLE (ready to go, but not running)
- WAITING (waiting for a particular event)
To change the states of a thread, we often use 2 APIs:
suspend(): save state of current thread to memory.resume(): restore state from memory.
In reality, most threads spend most of the time waiting for events to occur. So we use yield() to let the current thread voluntarily suspend itself, and then let the kernel choose a new thread to resume execution.
In particular, we maintain a processor table and a thread table.
- The processor table (
cpus) keeps track of which processor is currently running which thread; - The thread table (
threads) keeps track of thread states.
yield_(): acquire(t_lock) # 1. Suspend the running thread id = cpus[CPU].thread # thread #id is on #CPU threads[id].state = RUNNABLE threads[id].sp = SP # stack pointer threads[id].ptr = PTR # page table register
# 2. Choose the new thread to run do: id = (id + 1) mod N while threads[id].state != RUNNABLE
# 3. Resume the new thread SP = threads[id].sp PTR = threads[id].ptr threads[id].state = RUNNING cpus[CPU].thread = id
release(t_lock)
# send a `message` into `bb`(N-slot buffer)send(bb, message ): acquire(bb.lock) # when the buffer is full while bb.in_num - bb.out_num >= N: release(bb.lock) yield_() acquire(bb.lock) bb.buf[bb.in_num % N] <- message bb.in_num += 1 release(bb.lock)
# reveive a message from bbreceive(bb): acquire(bb.lock) # while the buffer is empty while bb.out_num >= bb.in_num: release(bb.lock) yield_() acquire(bb.lock) message <- bb.buf[bb.out_num % N] bb.out_num += 1 release(bb.lock) return messageHowever, the sender may get resumed in the meantime, even if there is no room in buffer. One solution to fix that is to use condition variables
Condition variable is simply a synchronization primitive that allow kernel to notify threads instead of having threads constantly make checks. And it has 2 APIs:
wait(cv): yield processor and wait to be notified ofcv, a condition variable.notify(cv): notify threads that are waiting forcv.
However, condition variables without lock may cause “Lost notify” problem:
# send a `message` into `bb`(N-slot buffer)send(bb, message ): acquire(bb.lock) # while the buffer is full while bb.in_num - bb.out_num >= N: release(bb.lock) wait(bb.has_space) ### ! acquire(bb.lock) bb.buf[bb.in_num % N] <- message bb.in_num += 1 release(bb.lock) notify(bb.has_message) return
# reveive a message from bbreceive(bb): acquire(bb.lock) # while the buffer is empty while bb.out_num >= bb.in_num: release(bb.lock) wait(bb.has_message) acquire(bb.lock) message <- bb.buf[bb.out_num % N] bb.out_num += 1 release(bb.lock) notify(bb.has_space) ### ! return messageConsidering there are two threads: T1(sender), and T2(receiver).
- T1 acquires
bb.lockon buffer, finding it full, so T1 releasesbb.lock - Prior to T1 calling
wait(bb.has_space), T2 just acquiresbb.lockto read messages, notifying the T1 that the buffer now has space(s). - but T1 is actually not yet waiting for
bb.has_space(Bacause T1 was interrupted by OS before it could callwait(bb.has_space)).
So, as you can see, it cause the “lost notify” problem. And the solution to fix that is use a lock.
wait(cv, lock): yield processor, release lock, wait to be notified ofcvnotify(cv): notify waiting threads ofcv
yield_wait(): id = cpus[CPU].thread threads[id].sp = SP threads[id].ptr = PTR SP = cpus[CPU].stack # avoid stack corruption
do: id = (id + 1) mod N release(t_lock) # ! acquire(t_lock) # ! while threads[id].state != RUNNABLE
SP = threads[id].sp PTR = threads[id].ptr threads[id].state = RUNNING cpus[CPU].thread = id
wait(cv, lock): acquire(t_lock) release(lock) # let others access what `lock` protects # mark the current thread: wait for `cv` id = cpus[CPU].thread threads[id].cv = cv threads[id].state = WAITING
# different from `yield_()` mentioned above! yield_wait()
release(t_lock) acquire(lock) # disallow others to access what `lock` protects
notify(cv): acquire(t_lock) # Find all threads waiting for `cv`, # and change states: WAITING -> RUNNABLE for id = 0 to N-1: if threads[id].cv == cv && threads[id].state == WAITING: threads[id].state = RUNNABLE release(t_lock)
# send `message` into N-slot buffer `bb`send(bb, message): acquire(bb.lock) while bb.in_num - bb.out_num >= N: wait(bb.has_space, bb.lock) bb.buf[bb.in_num % N] <- message bb.in_num += 1 release(bb.lock) notify(bb.has_message) return
# reveive a message from bbreceive(bb): acquire(bb.lock) # while the buffer is empty while bb.out_num >= bb.in_num: wait(bb.has_message, bb.lock) message <- bb.buf[bb.out_num % N] bb.out_num += 1 release(bb.lock) notify(bb.has_space) return messageNote:
- Why
yield_wait(), rather thanyield_()? Becauseyield_()will cause Deadlock. At the beginning ofwait(cv, lock), we acquire and holdt_lock. So if we invokeyield_(), it will try to acquiret_lockagain, causing deadlock problem. - Why
yield_wait()releases and then immediately acquirest_lock? Because it guarantee other threads can access the buffer. Considering there are 5 senders writing into buffer and only 1 receiver reading the buffer. If all 5 senders find the buffer full, it is important to releaset_lockto let the only 1 receiver acquire thet_lockand read the buffer. - Why do we need to
SP = cpus[CPU].stack? To avoid stack corruption when this thread is scheduled to a different CPU.
And the new problem arises, what if the thread never yield CPU? Use preemption.
Preemption forcibly interrupts a thread so that we don’t have to rely on programmers correctly using yield(). In this case, if a thread never calls yield() or wait(), it’s okay; special hardware will periodically generate an interrupt and forcibly call yield().
But what if this interrupt occurs while running yield() or yield_wait(): Deadlock. And the solution is to require hardware mechanism to disable interrupts.
The kernel is a non-interruptible, trusted program that runs system code.
Kernel errors are fatal, so we try to limit the size of kernel code. There are two models for kernels.
- The monolithic kernel implements most of the OS in the kernel, and everything sharing
- The microkernel implements different features as client-servers. They enforce modularity by putting subsystems in user programs.
Virtual Machine (VM) allows us to run multiple isolated operating systems on a single physical machine. VMs must handle the challenges of virtualizing the hardware.
The Virtual Machine Monitor (VMM) deals with privileged instructions, allocates resources, and dispatches events.
The guest OS runs in user mode. Privileged instructions throw exceptions, and VMM will trap and emulate. In modern hardware, the physical hardware knows of both page tables, and it directly translates from guest virtual address to host physical address.
However, there are still some cases in which we cannot trap exceptions. There are several solutions:
- Para-virtualization is where the guest OS changes a bit, which defeats the purpose of a VM
- Binary translation is also a method (VMWare used to use this), but it is slightly messy
- Hardware support for virtualization means that hardware has VMM capabilities built-in. The guest OS can directly manipulate page tables, etc. Most VMMs today have hardware support.
There are 3 metrics to measure performance:
- latency: how long does it take to complete a single task?
- Throughput: the rate of useful work, or how many requests per unit of time.
- Utilization: what fraction of resources are being utilized