DDIA Ch01: Reliable, Scalable, and Maintainable Applications

Hi!

Let's read the Chapter 01: Reliable, Scalable, and Maintainable Applications of Designing Data-Intensive Applications.

It introduces the terminology and approach that we are going to use throughout the book, and it also explores some fundamental ways of thinking about data-intensive applications: general properties (nonfunctional requirements) such as reliability, scalability, and maintainability.

First of all, there are 2 types of applications:

  • compute-intensive applications: raw CPU power is a limiting factor
  • data-intensive applications: the bigger problems are usually the amount of data, the complexity of data, and the speed at which it is changing.

And many applications today are data-intensive, which are typically built from standard building blocks (commonly needed functionalities):

  • Databases
  • Caches
  • Search Indexes
  • Streaming Processing
  • Batch Processing

In reality, however, it can be hard to combine these tools when building an application.

1.1 Thinking About Data Systems

In this section, we talk about the background of the Data Systems.

Data Systems all can store data for some time, but with different access patterns, which means different performance characteristics, and thus very different implementations.

In recent years, with new tools for data processing and storage emerged, the boundaries between traditional categories are becoming blurred. And with different tools stitched together by application code, the work is broken down into tasks that can be performed efficiently on a single tool.

However, a lot of tricky questions arise when designing a data system or service. And in this book, we mainly focus on 3 concerns that are important in most software systems: Reliability, Scalabilility, and Maintainability.

1.2 Reliability

In this section, we deals with the kinds of faults that can be cured, such as hardware faults, software errors, and human errors.

First of all, the Reliability means that the system should continue to work correctly, even in the face of adversity.

However, if things did go wrong, it could only make sense to talk about tolerating certain types of faults, preventing faults from causing failures.

In practice, we generally prefer tolerating faults over preventing faults, and by deliberately inducing faults, we ensure that the fault-tolerant machinery is continually exercised and tested.

1.2.1 Hardware Faults

Hardware faults are faults that happen randomly, reported as having a Mean Time To Failure (MTTF).

Hardware Faults have weak correlation, and thus are independent from each other.

Solution for tolerating faults (rather than preventing faults):

  • add hardware redundancy
  • use software fault-tolerance techniques

1.2.2 Software Errors

Software Errors are systematic errors within the system.

Software errors have strong correlation, which means they are correlated across nodes.

Solutions:

  • carefully thinking about assumptions and interactions in the system.
  • thorough testing
  • process isolation
  • allowing process(es) to crash and restart
  • measuring, monitoring, and analyzing system behavior in production

1.2.3 Human Errors

Human errors are caused by human operations, and thus human are known to be unreliable.

Approaches:

  • minimize opportunities for error when designing systems
  • use sandbox environments to decouple places where people make mistakes from places where mistakes causing outage
  • test thoroughly, from unit tests to whole-system integration tests and manual tests
  • quick and easy recovery from human errors
  • detailed and clear monitoring, e.g., telemetry
  • good management practices and training

1.3 Scalabilility

In this section, we focus on scalabilility - the ability that a system have to cope with the the increased load.

1.3.1 Describing 'Load'

Load can be described with a few numbers, called load parameters.

The best choice of parameters depends on the architecture of the system.

1.3.2 Describing 'Performance'

We use performance numbers to investigate what happens when load increases.

And we use percentile, one of the performance numbers, to denote response time, which is a distribution of values that can be measured (e.g., p999 meaning 99.9% of requests are handled faster than the particular threshold).

However, reducing response times at very high percentiles (known as tail latencies) may be too expensive, and may be difficult due to random events outside your control.

Queueing delays often account for a large part of the response time at high percentiles, for the following reasons:

  • head-of-line blocking: a small number of slow requests in parallel hold up the processing of subsequent requests.
  • tail latency amplification: just one slow backend request can slow down the entire end-user requests.

1.3.3 Coping with Load

In this part, we talk about how to maintain good performance, even when load parameters increase.

  • Rethink architecture on every order of magnitude of load increases.
  • Use a mixture of 2 scaling approaches
    • scaling up, or vertical scaling: moving to a more powerful machine
    • scaling down, or horizontal scaling: distributing the load across multiple machines, also known as shared-nothing architecture
  • When choosing load parameters, figure out which operations will be common and which will be rare.
  • Use elastic systems to add computing resources automatically if load is highly unpredictable; but manually scaled systems are simpler and may have fewer operational surprises.

1.4 Maintainability

The majority of cost of software is in its ongoing maintenance, so software should be designed to minimize pain during maintenance, and thus to avoid creating legacy softwares.

And in this section, we pay attention to 3 designing principles for software systems: operability, simplicity, and evolvability.

1.4.1 Operability

Operability can make it easy for operations teams to keep the system running smoothly.

Data system should provide good operability, which means making routine tasks easy, allowing the operations team to focus their efforts on high-value activities.

1.4.2 Simplicity

Simplicity can make it easy for new engineers to understand the system.

We use abstraction to remove accidental complexity, which is not inherent in the problem that software solves (as seen by users) but arises only from the implementation.

And our goal is to use good abstraction to extract parts of the large systems into well-defined, reusable components.

1.4.3 Evolvability

Evolvability can make it easy for engineers to make changes to the system in the future, adapting it for unanticipated use cases as requirements change.

In terms of organizational processes, we use a framework from Agile working patterns to adapt to change. And the Agile community has also developed technical tools and patterns that are helpful when developing softwares in frequently changing environments, such as test-driven development (TDD) and refactoring.

And in this book, we will use evolvability to refer to agility on a data system level.


* This blog was last updated on 2022-05-19 22:06