MIT 6.033 CSE Operating System

Course Note for Operating System of MIT 6.033 Computer System Engineering covers common *design patterns* to limit *complexity* and techniques, like *virtualization* and *abstraction*, to enforce *modularity*.

April 6, 2023 System Design 7 min read

MIT 6.033 (Computer System Engineering) covers 4 parts: Operating Systems, Networking, Distributed Systems, and Security.

This is the course note for Part I: Operating Systems. And in this section, we mainly focus on:

How common design patterns in computer system — such as abstraction and modularity — are used to limit complexity.
How operating systems use virtualization and abstraction to enforce modularity.

In this section, we talk about what is complexity in computer systems, and how to mitigate it.

A system is a set of interconnected components that has an expected behavior observed at the interface with its environment.

So we say that a system has complexity, which limits what we can build. However, complexity can be mitigated with design patterns, such as modularity and abstration.

Nowadays, we usually enforce modularity by Client/Server Model, or C/S Model, where two modules reside on different machines and communicate with RPCs.

In this section, we talk about naming, which allows modules to communicate.

Naming is that a name can be resolved to the entity it refers to. Therefore, it allows modules to interact, and can help to achieve goals such as indirection, user-friendliness, etc.

The design of a naming scheme has 3 parts: name, value, and look-up algorithm.

One great case of naming scheme is Domain Name System (DNS), which illustrates principles such as hierarchy, scalability, delegation and decentralization. Especially, the hierarchical design of DNS let us scale up to the Internet.

Virtual Memory is a primary technique that uses Memory Management Unit (MMU) to translate virtual address into physical address by using page tables.

To enforce modularity, the operating system(OS) kernel checks the following 3 bits:

Name	Description
User/Supervisor (U/S) bit	if the program allowed to access the address
Present (P) bit	if the page currently in memory
User/Kernel (U/K) bit	whether the operation is in user mode or kernel mode

These 3 bits let the OS know when to trigger page faults, and if the access triggers an exception, the OS kernel will first switch to kernel mode and then execute the corresponding exception handler before switching back to user mode.

To deal with performance issues, the Operating Systems introduce two mechanisms: hierarchical page table and cache. The hierarchical(multilevel) page table reduces the memory overhead associated with the page table, at the expense of more table look-ups. And cache, also known as Translation Lookaside Buffer (TLB), stores recent translations of virtual memory to physical addresses to enable faster retrieval.

OS enforces modularity by virtualization and abstraction. On resources that can be virtualized, such as memory, OS uses virtualization. And for those components that are difficult to virtualize such as disk and network, OS presents abstration.

Let’s virtualize communication links - the bouded buffers.

But first, we need Lock, which is a protecting mechanism that allows only one CPU to execute a piece of code at a time to implement atomic actions. If two CPUs try to acquire the same lock at the same time, only one of them will succeed and the other will block until the first CPU releases the lock.

Implementing locks is possible by the support of a special hardware called controller that manages access to memory.

acquire(lock):
    do:
        r = 1
        XCHG r, lock
    while r == 1

release(lock):
    lock = 0

A bounded buffer is a buffer that has (up to) N slots and allows concurrent programs to send/receive messages.

A bounded buffer with lock may deal with race condition, therefore, we need to decide where to put locks:

coarse-grained locking is easy to maintain correctness, but it will lead to bad performance;
fine-grained locking improves performance, but it may cause inconsistent state;
multiple locking requires that locks are acquired in the same order, otherwise the dead lock may happen.

In addition, bounded buffer with lock is yet another example of virtualization, which means any of senders/receivers think it has full access to the whole buffer.

Let’s virtualize processors - the threads.

Thread is a virtual processor and has 3 states:

RUNNING (actively running)
RUNNABLE (ready to go, but not running)
WAITING (waiting for a particular event)

To change the states of a thread, we often use 2 APIs:

suspend(): save state of current thread to memory.
resume(): restore state from memory.

In reality, most threads spend most of the time waiting for events to occur. So we use yield() to let the current thread voluntarily suspend itself, and then let the kernel choose a new thread to resume execution.

In particular, we maintain a processor table and a thread table.

The processor table (cpus) keeps track of which processor is currently running which thread;
The thread table (threads) keeps track of thread states.

yield_():
    acquire(t_lock)
    # 1. Suspend the running thread
    id = cpus[CPU].thread # thread #id is on #CPU
    threads[id].state = RUNNABLE
    threads[id].sp = SP   # stack pointer
    threads[id].ptr = PTR # page table register

    # 2. Choose the new thread to run
    do:
        id = (id + 1) mod N
    while threads[id].state != RUNNABLE

    # 3. Resume the new thread
    SP = threads[id].sp
    PTR = threads[id].ptr
    threads[id].state = RUNNING
    cpus[CPU].thread = id

    release(t_lock)

# send a `message` into `bb`(N-slot buffer)
send(bb, message ):
    acquire(bb.lock)
    # when the buffer is full
    while bb.in_num - bb.out_num >= N:
        release(bb.lock)
        yield_()
        acquire(bb.lock)
    bb.buf[bb.in_num % N] <- message
    bb.in_num += 1
    release(bb.lock)

# reveive a message from bb
receive(bb):
    acquire(bb.lock)
    # while the buffer is empty
    while bb.out_num >= bb.in_num:
        release(bb.lock)
        yield_()
        acquire(bb.lock)
    message <- bb.buf[bb.out_num % N]
    bb.out_num += 1
    release(bb.lock)
    return message

However, the sender may get resumed in the meantime, even if there is no room in buffer. One solution to fix that is to use condition variables

Condition variable is simply a synchronization primitive that allow kernel to notify threads instead of having threads constantly make checks. And it has 2 APIs:

wait(cv): yield processor and wait to be notified of cv, a condition variable.
notify(cv): notify threads that are waiting for cv.

However, condition variables without lock may cause “Lost notify” problem:

# send a `message` into `bb`(N-slot buffer)
send(bb, message ):
    acquire(bb.lock)
    # while the buffer is full
    while bb.in_num - bb.out_num >= N:
        release(bb.lock)
        wait(bb.has_space)  ### !
        acquire(bb.lock)
    bb.buf[bb.in_num % N] <- message
    bb.in_num += 1
    release(bb.lock)
    notify(bb.has_message)
    return

# reveive a message from bb
receive(bb):
    acquire(bb.lock)
    # while the buffer is empty
    while bb.out_num >= bb.in_num:
        release(bb.lock)
        wait(bb.has_message)
        acquire(bb.lock)
    message <- bb.buf[bb.out_num % N]
    bb.out_num += 1
    release(bb.lock)
    notify(bb.has_space) ### !
    return message

Considering there are two threads: T1(sender), and T2(receiver).

T1 acquires bb.lock on buffer, finding it full, so T1 releases bb.lock
Prior to T1 calling wait(bb.has_space), T2 just acquires bb.lock to read messages, notifying the T1 that the buffer now has space(s).
but T1 is actually not yet waiting for bb.has_space (Bacause T1 was interrupted by OS before it could call wait(bb.has_space)).

So, as you can see, it cause the “lost notify” problem. And the solution to fix that is use a lock.

wait(cv, lock): yield processor, release lock, wait to be notified of cv
notify(cv): notify waiting threads of cv

yield_wait():
    id = cpus[CPU].thread
    threads[id].sp = SP
    threads[id].ptr = PTR
    SP = cpus[CPU].stack # avoid stack corruption

    do:
        id = (id + 1) mod N
        release(t_lock) # !
        acquire(t_lock) # !
    while threads[id].state != RUNNABLE

    SP = threads[id].sp
    PTR = threads[id].ptr
    threads[id].state = RUNNING
    cpus[CPU].thread = id


wait(cv, lock):
    acquire(t_lock)
    release(lock) # let others access what `lock` protects
    # mark the current thread: wait for `cv`
    id = cpus[CPU].thread
    threads[id].cv = cv
    threads[id].state = WAITING

    # different from `yield_()` mentioned above!
    yield_wait()

    release(t_lock)
    acquire(lock) # disallow others to access what `lock` protects


notify(cv):
    acquire(t_lock)
    # Find all threads waiting for `cv`,
    #  and change states: WAITING -> RUNNABLE
    for id = 0 to N-1:
        if threads[id].cv == cv &&
         threads[id].state == WAITING:
            threads[id].state = RUNNABLE
    release(t_lock)

# send `message` into N-slot buffer `bb`
send(bb, message):
    acquire(bb.lock)
    while bb.in_num - bb.out_num >= N:
        wait(bb.has_space, bb.lock)
    bb.buf[bb.in_num % N] <- message
    bb.in_num += 1
    release(bb.lock)
    notify(bb.has_message)
    return

# reveive a message from bb
receive(bb):
    acquire(bb.lock)
    # while the buffer is empty
    while bb.out_num >= bb.in_num:
        wait(bb.has_message, bb.lock)
    message <- bb.buf[bb.out_num % N]
    bb.out_num += 1
    release(bb.lock)
    notify(bb.has_space)
    return message

Note:

Why yield_wait(), rather than yield_()? Because yield_() will cause Deadlock. At the beginning of wait(cv, lock), we acquire and hold t_lock. So if we invoke yield_(), it will try to acquire t_lock again, causing deadlock problem.
Why yield_wait() releases and then immediately acquires t_lock? Because it guarantee other threads can access the buffer. Considering there are 5 senders writing into buffer and only 1 receiver reading the buffer. If all 5 senders find the buffer full, it is important to release t_lock to let the only 1 receiver acquire the t_lock and read the buffer.
Why do we need to SP = cpus[CPU].stack? To avoid stack corruption when this thread is scheduled to a different CPU.

And the new problem arises, what if the thread never yield CPU? Use preemption.

Preemption forcibly interrupts a thread so that we don’t have to rely on programmers correctly using yield(). In this case, if a thread never calls yield() or wait(), it’s okay; special hardware will periodically generate an interrupt and forcibly call yield().

But what if this interrupt occurs while running yield() or yield_wait(): Deadlock. And the solution is to require hardware mechanism to disable interrupts.

The kernel is a non-interruptible, trusted program that runs system code.

Kernel errors are fatal, so we try to limit the size of kernel code. There are two models for kernels.

The monolithic kernel implements most of the OS in the kernel, and everything sharing
The microkernel implements different features as client-servers. They enforce modularity by putting subsystems in user programs.

Virtual Machine (VM) allows us to run multiple isolated operating systems on a single physical machine. VMs must handle the challenges of virtualizing the hardware.

Figure 1. Virtual Machine Environment

The Virtual Machine Monitor (VMM) deals with privileged instructions, allocates resources, and dispatches events.

The guest OS runs in user mode. Privileged instructions throw exceptions, and VMM will trap and emulate. In modern hardware, the physical hardware knows of both page tables, and it directly translates from guest virtual address to host physical address.

However, there are still some cases in which we cannot trap exceptions. There are several solutions:

Para-virtualization is where the guest OS changes a bit, which defeats the purpose of a VM
Binary translation is also a method (VMWare used to use this), but it is slightly messy
Hardware support for virtualization means that hardware has VMM capabilities built-in. The guest OS can directly manipulate page tables, etc. Most VMMs today have hardware support.

There are 3 metrics to measure performance:

latency: how long does it take to complete a single task?

Figure 2. Latency vs. number of users

Throughput: the rate of useful work, or how many requests per unit of time.

Figure 3. Throughput vs. number of users

Utilization: what fraction of resources are being utilized

Comments