MIT 6.033 CSE Operating System
MIT 6.033 (Computer System Engineering) covers 4 parts: Operating Systems, Networking, Distributed Systems, and Security.
This is the course note for Part I: Operating Systems. And in this section, we mainly focus on:
- How common design patterns in computer system — such as abstraction and modularity — are used to limit complexity.
- How operating systems use virtualization and abstraction to enforce modularity.
Complexity
In this section, we talk about what is complexity in computer systems, and how to mitigate it.
A system is a set of interconnected components that has an expected behavior observed at the interface with its environment.
So we say that a system has complexity, which limits what we can build. However, complexity can be mitigated with design patterns, such as modularity and abstration.
Nowadays, we usually enforce modularity by Client/Server Model, or C/S Model, where two modules reside on different machines and communicate with RPCs.
Naming Schemes
In this section, we talk about naming, which allows modules to communicate.
Naming is that a name can be resolved to the entity it refers to. Therefore, it allows modules to interact, and can help to achieve goals such as indirection, user-friendliness, etc.
The design of a naming scheme has 3 parts: name, value, and look-up algorithm.
One great case of naming scheme is Domain Name System (DNS), which illustrates principles such as hierarchy, scalability, delegation and decentralization. Especially, the hierarchical design of DNS let us scale up to the Internet.
Virtual Memory
Virtual Memory is a primary technique that uses Memory Management Unit (MMU) to translate virtual address into physical address by using page tables.
To enforce modularity, the operating system(OS) kernel checks the following 3 bits:
Name | Description |
---|---|
User/Supervisor (U/S) bit | if the program allowed to access the address |
Present (P) bit | if the page currently in memory |
User/Kernel (U/K) bit | whether the operation is in user mode or kernel mode |
These 3 bits let the OS know when to trigger page faults, and if the access triggers an exception, the OS kernel will first switch to kernel mode and then execute the corresponding exception handler before switching back to user mode.
To deal with performance issues, the Operating Systems introduce two mechanisms: hierarchical page table and cache. The hierarchical(multilevel) page table reduces the memory overhead associated with the page table, at the expense of more table look-ups. And cache, also known as Translation Lookaside Buffer (TLB), stores recent translations of virtual memory to physical addresses to enable faster retrieval.
OS enforces modularity by virtualization and abstraction. On resources that can be virtualized, such as memory, OS uses virtualization. And for those components that are difficult to virtualize such as disk and network, OS presents abstration.
Bounded Buffer with Lock
Let's virtualize communication links - the bouded buffers.
But first, we need Lock, which is a protecting mechanism that allows only one CPU to execute a piece of code at a time to implement atomic actions. If two CPUs try to acquire the same lock at the same time, only one of them will succeed and the other will block until the first CPU releases the lock.
Implementing locks is possible by the support of a special hardware called controller that manages access to memory.
1acquire(lock):
2 do:
3 r = 1
4 XCHG r, lock
5 while r == 1
6
7release(lock):
8 lock = 0
A bounded buffer is a buffer that has (up to) N slots and allows concurrent programs to send/receive messages.
A bounded buffer with lock may deal with race condition, therefore, we need to decide where to put locks:
- coarse-grained locking is easy to maintain correctness, but it will lead to bad performance;
- fine-grained locking improves performance, but it may cause inconsistent state;
- multiple locking requires that locks are acquired in the same order, otherwise the dead lock may happen.
In addition, bounded buffer with lock is yet another example of virtualization, which means any of senders/receivers think it has full access to the whole buffer.
Concurrent Threads
Let's virtualize processors - the threads.
Thread
Thread is a virtual processor and has 3 states:
- RUNNING (actively running)
- RUNNABLE (ready to go, but not running)
- WAITING (waiting for a particular event)
To change the states of a thread, we often use 2 APIs:
suspend()
: save state of current thread to memory.resume()
: restore state from memory.
In reality, most threads spend most of the time waiting for events to occur. So we use yield()
to let the current thread voluntarily suspend itself, and then let the kernel choose a new thread to resume execution.
In particular, we maintain a processor table and a thread table.
- The processor table (
cpus
) keeps track of which processor is currently running which thread; - The thread table (
threads
) keeps track of thread states.
1yield_():
2 acquire(t_lock)
3 # 1. Suspend the running thread
4 id = cpus[CPU].thread # thread #id is on #CPU
5 threads[id].state = RUNNABLE
6 threads[id].sp = SP # stack pointer
7 threads[id].ptr = PTR # page table register
8
9 # 2. Choose the new thread to run
10 do:
11 id = (id + 1) mod N
12 while threads[id].state != RUNNABLE
13
14 # 3. Resume the new thread
15 SP = threads[id].sp
16 PTR = threads[id].ptr
17 threads[id].state = RUNNING
18 cpus[CPU].thread = id
19
20 release(t_lock)
21
22# send a `message` into `bb`(N-slot buffer)
23send(bb, message ):
24 acquire(bb.lock)
25 # when the buffer is full
26 while bb.in_num - bb.out_num >= N:
27 release(bb.lock)
28 yield_()
29 acquire(bb.lock)
30 bb.buf[bb.in_num % N] <- message
31 bb.in_num += 1
32 release(bb.lock)
33
34# reveive a message from bb
35receive(bb):
36 acquire(bb.lock)
37 # while the buffer is empty
38 while bb.out_num >= bb.in_num:
39 release(bb.lock)
40 yield_()
41 acquire(bb.lock)
42 message <- bb.buf[bb.out_num % N]
43 bb.out_num += 1
44 release(bb.lock)
45 return message
However, the sender may get resumed in the meantime, even if there is no room in buffer. One solution to fix that is to use condition variables
Condition Variable
Condition variable is simply a synchronization primitive that allow kernel to notify threads instead of having threads constantly make checks. And it has 2 APIs:
wait(cv)
: yield processor and wait to be notified ofcv
, a condition variable.notify(cv)
: notify threads that are waiting forcv
.
However, condition variables without lock may cause "Lost notify" problem:
1# send a `message` into `bb`(N-slot buffer)
2send(bb, message ):
3 acquire(bb.lock)
4 # while the buffer is full
5 while bb.in_num - bb.out_num >= N:
6 release(bb.lock)
7 wait(bb.has_space) ### !
8 acquire(bb.lock)
9 bb.buf[bb.in_num % N] <- message
10 bb.in_num += 1
11 release(bb.lock)
12 notify(bb.has_message)
13 return
14
15# reveive a message from bb
16receive(bb):
17 acquire(bb.lock)
18 # while the buffer is empty
19 while bb.out_num >= bb.in_num:
20 release(bb.lock)
21 wait(bb.has_message)
22 acquire(bb.lock)
23 message <- bb.buf[bb.out_num % N]
24 bb.out_num += 1
25 release(bb.lock)
26 notify(bb.has_space) ### !
27 return message
Considering there are two threads: T1(sender), and T2(receiver).
- T1 acquires
bb.lock
on buffer, finding it full, so T1 releasesbb.lock
- Prior to T1 calling
wait(bb.has_space)
, T2 just acquiresbb.lock
to read messages, notifying the T1 that the buffer now has space(s). - but T1 is actually not yet waiting for
bb.has_space
(Bacause T1 was interrupted by OS before it could callwait(bb.has_space)
).
So, as you can see, it cause the "lost notify" problem. And the solution to fix that is use a lock.
wait(cv, lock)
: yield processor, release lock, wait to be notified ofcv
notify(cv)
: notify waiting threads ofcv
1yield_wait():
2 id = cpus[CPU].thread
3 threads[id].sp = SP
4 threads[id].ptr = PTR
5 SP = cpus[CPU].stack # avoid stack corruption
6
7 do:
8 id = (id + 1) mod N
9 release(t_lock) # !
10 acquire(t_lock) # !
11 while threads[id].state != RUNNABLE
12
13 SP = threads[id].sp
14 PTR = threads[id].ptr
15 threads[id].state = RUNNING
16 cpus[CPU].thread = id
17
18
19wait(cv, lock):
20 acquire(t_lock)
21 release(lock) # let others access what `lock` protects
22 # mark the current thread: wait for `cv`
23 id = cpus[CPU].thread
24 threads[id].cv = cv
25 threads[id].state = WAITING
26
27 # different from `yield_()` mentioned above!
28 yield_wait()
29
30 release(t_lock)
31 acquire(lock) # disallow others to access what `lock` protects
32
33
34notify(cv):
35 acquire(t_lock)
36 # Find all threads waiting for `cv`,
37 # and change states: WAITING -> RUNNABLE
38 for id = 0 to N-1:
39 if threads[id].cv == cv &&
40 threads[id].state == WAITING:
41 threads[id].state = RUNNABLE
42 release(t_lock)
43
44# send `message` into N-slot buffer `bb`
45send(bb, message):
46 acquire(bb.lock)
47 while bb.in_num - bb.out_num >= N:
48 wait(bb.has_space, bb.lock)
49 bb.buf[bb.in_num % N] <- message
50 bb.in_num += 1
51 release(bb.lock)
52 notify(bb.has_message)
53 return
54
55# reveive a message from bb
56receive(bb):
57 acquire(bb.lock)
58 # while the buffer is empty
59 while bb.out_num >= bb.in_num:
60 wait(bb.has_message, bb.lock)
61 message <- bb.buf[bb.out_num % N]
62 bb.out_num += 1
63 release(bb.lock)
64 notify(bb.has_space)
65 return message
Note:
- Why
yield_wait()
, rather thanyield_()
? Becauseyield_()
will cause Deadlock. At the beginning ofwait(cv, lock)
, we acquire and holdt_lock
. So if we invokeyield_()
, it will try to acquiret_lock
again, causing deadlock problem. - Why
yield_wait()
releases and then immediately acquirest_lock
? Because it guarantee other threads can access the buffer. Considering there are 5 senders writing into buffer and only 1 receiver reading the buffer. If all 5 senders find the buffer full, it is important to releaset_lock
to let the only 1 receiver acquire thet_lock
and read the buffer. - Why do we need to
SP = cpus[CPU].stack
? To avoid stack corruption when this thread is scheduled to a different CPU.
And the new problem arises, what if the thread never yield CPU? Use preemption.
Preemption
Preemption forcibly interrupts a thread so that we don’t have to rely on programmers correctly using yield()
. In this case, if a thread never calls yield()
or wait()
, it’s okay; special hardware will periodically generate an interrupt and forcibly call yield()
.
But what if this interrupt occurs while running yield()
or yield_wait()
: Deadlock. And the solution is to require hardware mechanism to disable interrupts.
Kernel
The kernel is a non-interruptible, trusted program that runs system code.
Kernel errors are fatal, so we try to limit the size of kernel code. There are two models for kernels.
- The monolithic kernel implements most of the OS in the kernel, and everything sharing
- The microkernel implements different features as client-servers. They enforce modularity by putting subsystems in user programs.
Virtual Machine
Virtual Machine (VM) allows us to run multiple isolated operating systems on a single physical machine. VMs must handle the challenges of virtualizing the hardware.
The Virtual Machine Monitor (VMM) deals with privileged instructions, allocates resources, and dispatches events.
The guest OS runs in user mode. Privileged instructions throw exceptions, and VMM will trap and emulate. In modern hardware, the physical hardware knows of both page tables, and it directly translates from guest virtual address to host physical address.
However, there are still some cases in which we cannot trap exceptions. There are several solutions:
- Para-virtualization is where the guest OS changes a bit, which defeats the purpose of a VM
- Binary translation is also a method (VMWare used to use this), but it is slightly messy
- Hardware support for virtualization means that hardware has VMM capabilities built-in. The guest OS can directly manipulate page tables, etc. Most VMMs today have hardware support.
Performance
There are 3 metrics to measure performance:
- latency: how long does it take to complete a single task?
- Throughput: the rate of useful work, or how many requests per unit of time.
- Utilization: what fraction of resources are being utilized