MIT 6.033 CSE Distributed Systems

MIT 6.033 (Computer System Engineering) covers 4 parts: Operating Systems, Networking, Distributed Systems, and Security.

This is the course note for Part III: Distributed Systems. And in this section, we mainly focus on: How reliable, usable distributed systems are able to be built on top of an unreliable network.

Reliability via Replication

In this section, we talk about how to achieve reliability via replication, especially RAID(Redundant Array of Independent Disks) that tolerates disk faults. And we assume that the entire machine could fail.

Generally, there are 3 steps to build reliable systems:

  1. identify all possible faults
  2. detect and contain the faults
  3. handle faults ("recover")

To quantify the reliability, we use availability: $$ Availability = \frac{MTTF}{MTTF+MTTR} \tag{1.1}$$ where MTTF (Mean Time To Failure) is the average time between non-repairable failures, and MTTR (Mean Time To Recovery) is the average time it takes to repair a system.

RAID replicates data across disks so that it can tolerate disk failures.

  • RAID-1: mirrors a single disk, but requires $2n$ disks.
  • RAID-4: has a dedicated parity disk, requires $n+1$ disks, but all writes go to the parity disk ("bottleneck").
  • RAID-5: spreads out the parity (stripes a single file across multiple disks), spreads out the write requests (better performance), requires $n+1$ disks.

Single-Machine Transactions

In this section, we talk about abstractions to make fault-tolerance achievable: transactions. And we assume that the entire machine works fine, but some operations may fail.

Transactions provide atomicity and isolation - make the reasoning about failures (and concurrency) easier.

Atomicity

Atomicity refers to an action either happens completely or does not happen at all.

For one user and one file, we implement atomicity by shadow copies (write to a temporary file, and then rename it to bank_file, for example), but they perform poorly.

We keep logs in cell storage on disk to record operations, so that uncommitted operations before crash can be reverted. There are two kinds of records: UPDATE and COMMIT:

  • UPDATE records have the old and new values
  • COMMIT records indicate that a transaction has been commited.

To speed up the recovery process, we write checkpoints and truncate the log.

Isolation via 2PL

In this section, we use Two-Phase Locking (2PL) to run transactions ($T_1, T_2, ..., T_n$) concurrently, but to produce a schedule that is conflict serializable.

Isolation refers to how and when the effects of one action (A1) are visible to another (A2). As a result, A1 and A2 appear to have executed serially, even though they are actually executed in parallel.

Two operations are conflict if they operate on the same object and at least one of them is a write. A schedule is conflict serializable if the order of all its conflicts is the same as the order of the conflicts in sequential schedule.

We use conflict graph to express the order of conflicts succinctly, so a schedule is conflict-serializable $\Leftrightarrow$ it has an acyclic conflict graph. E.g., consider the following schedule:

1T1: read(x)
2T2: write(x)
3T1: write(x)
4T3: write(x)

Explanation: Start from $T1$ reading x, we find $T2$ and $T3$ want to write to x. And then $T2$ is writing to x, we find $T1$ and $T3$ want to wirte to x. And then $T1$ is writing to x, we find $T3$ want to write to x.

---
title: Figure 1. Conflict Graph
---
graph LR
	T1 --> T2
	T1 --> T3
	T2 --> T1
	T2 --> T3

So, the conflict graph has cycle, so this schedule is not conflict-serializable.

Two-Phase Locking (2PL) is a concurrency control protocol used in database management systems (DBMS) to ensure the serializability of transactions. It consists of two distinct phases: the growing phase (transaction acquires locks and increases its hold on resources) and the shrinking phase (transaction releases all the locks and reduces its hold on resources).

A valid Two-Phase Locking schedule has the following rules:

  1. each shared variable has a lock
  2. before any operation on a variable, the transaction must acquire the corresponding lock
  3. after a transaction releases a lock, it may not acquire any other lock

However, 2PL can result in deadlock. Normal solution is to global ordering on locks. But a more elegant solution is to take advantage of the atomicity (of transactions) and abort one of the transactions.

If we want better performance, we use the 2PL with reader/writer locks:

  1. each variable has two locks: one for reading, one for writing
  2. before any operation on a variable, the transaction must acquire the appropriate lock.
  3. multiple transaction can hold reader locks for the same variable at once; a transaction can only hold a writer lock for a variable if there are no other locks held for that variable.
  4. after a transaction releases a lock, it may not acquire any other lock.

Distributed Transactions

When it comes to the distributed systems, the transactions are different.

Multisite Atomicity via 2PC

In this section, we use Two-Phase Commit (2PC) to get multisite atomicity, in the face of failures.

Two-Phase Commit (2PC) is a distributed transaction protocol to ensure the consistency of transactions across multiple nodes. 2PC consists of 2 phases:

  • Prepare Phase: Coordinator uses Prepare message to check if participants are ready to finish this transaction.
  • Commit Phase: Coordinator sends a Commit request to participants, waits for their OK response, and informs the client of the committed transaction.
sequenceDiagram
    title: Figure 2. Two-Phase Commit (no failure)
	participant CL as Client
	participant CO as Coordinator
	participant AM as A-M Server
	participant NZ as N-Z Server

	CL->>CO: Commit Request
	CO->>AM: Prepare
    AM-->>CO: 
	CO->>NZ: Prepare
	NZ-->>CO: 
	CO-->>CL: OK
	CO->>AM: Commit
    AM-->>CO: 
	CO->>NZ: Commit
	NZ-->>CO: 
    CO-->>CL: OK

However, 3 types of failures may happen:

  1. Message Loss(at any stage) or Message Reordering: solved by reliable transport protocol, such as TCP (with sequence number and ACKs).

  2. Failures before commit point that can be aborted:

    • Worker Failure BEFORE Prepare Phase: coordinator can saftly abort the transaction without additional communication to workers. (coordinator uses HELLO to detect failure of workers)
sequenceDiagram
    title: Figure 3. Worker Failure BEFORE Prepare Phase
	participant CL as Client
	participant CO as Coordinator
	participant A-M Server
	participant N-Z Server
    CL->>CO: Commit Request
	CO-->>CL: Abort
  • Worker Failure or Coordinator Failure DURING Prepare Phase: coordinator can saftly abort the transaction, will send explicit abort message to live workers.
sequenceDiagram
    title: Figure 4. Worker Fails DURING Prepare Phase
	participant CL as Client
	participant CO as Coordinator
	participant AM as A-M Server
	participant NZ as N-Z Server

	CL->>CO: Commit Request
	CO->>AM: Prepare
	AM-->>CO: 
	CO->>NZ: Prepare
	Note over NZ: worker fails
	CO->>AM: Abort
    AM-->>CO: 
    CO-->>CL: Abort
sequenceDiagram
    title: Figure 5. Coordinator Fails DURING Prepare Phase
	participant CL as Client
	participant CO as Coordinator
	participant AM as A-M Server
	participant NZ as N-Z Server

	CL->>CO: Commit Request
	CO->>AM: Prepare
	AM-->>CO: 
    Note over CO: coordinator fails and recovers
	CO->>AM: Abort
    AM-->>CO: 
    CO->>NZ: Abort
    NZ-->>CO: 
    CO-->>CL: Abort
  1. Worker Failure or Coordinator Failure during Commit Phase (after commit point): coordinator cannot abort the transaction; machines must commit the transaction during recovery.
sequenceDiagram
    title: Figure 6. Worker Fails during Commit Phase
	participant CL as Client
	participant CO as Coordinator
	participant AM as A-M Server
	participant NZ as N-Z Server

	CL->>CO: Commit Request
	CO->>AM: Prepare
    AM-->>CO: 
	CO->>NZ: Prepare
	NZ-->>CO: 
	CO-->>CL: OK
	CO->>AM: Commit
    AM-->>CO:  
	CO->>NZ: Commit
	Note over NZ: worker fails and recovers
    NZ-->>CO: should I commit?
    CO->>NZ: Commit
    NZ-->>CO: 
    CO-->>CL: OK

sequenceDiagram
    title: Figure 7. Coordinator Fails during Commit Phase
	participant CL as Client
	participant CO as Coordinator
	participant AM as A-M Server
	participant NZ as N-Z Server

	CL->>CO: Commit Request
	CO->>AM: Prepare
    AM-->>CO: 
	CO->>NZ: Prepare
	NZ-->>CO: 
	CO-->>CL: OK
	CO->>AM: Commit
    AM-->>CO:  
    Note over CO: coordinator fails and recovers
    CO->>AM: Commit
    AM-->>CO:  
    CO->>NZ: Commit
    NZ-->>CO: 
    CO-->>CL: OK

Replicate State Machines

In this section, we replicate on multiple machines, so that the availability is increased.

Replicate State Machines (RSM) use primary/backup mechanism for replication:

Figure 8. Replicate State Machine

  • Coordinators make requests to View Server, to find out which replica is primary, and contact the primary.
  • View Server ensures that only one replica acts as primary, and can recruit new backups if servers fail. It keeps a table that maintains a sequence of views, and receives pings from primary and backups.
  • Primary pings View Server, and gets contacts from coordinator, and then sends updates to backups. Primary must get an ACK from its backups before completing the update.
  • Backups ping View Server, and receive update requests from primary. (Note: Backups will reject any requests that they get directly from Coordinator)

* This blog was last updated on 2023-06-06 22:10