Chapter 20 Metrics Monitoring and Alerting System

In this chapter, we explore the design of a scalable metrics monitoring and alerting system. A well-designed monitoring and alerting system plays a key role in providing clear visibility into the health of the infrastructure to ensure high availability and reliability.

在本章中，我们将探讨可扩展的指标监控与告警系统的设计。一个设计良好的监控与告警系统在确保高可用性和可靠性方面至关重要，因为它能为基础设施的健康状况提供清晰可见的监控视图。

Figure 1 shows some of the most popular metrics monitoring and alerting services in the marketplace. In this chapter, we design a similar service that can be used internally by a large company.

图 1 展示了市场上一些最受欢迎的指标监控与告警服务。在本章中，我们将设计一个类似的服务，供大型公司内部使用。

download

Step 1 - Understand the Problem and Establish Design Scope

第一步——理解问题并确定设计范围

A metrics monitoring and alerting system can mean many different things to different companies, so it is essential to nail down the exact requirements first with the interviewer. For example, you do not want to design a system that focuses on logs such as web server error or access logs if the interviewer has only infrastructure metrics in mind.

Let’s first fully understand the problem and establish the scope of the design before diving into the details.

Candidate: Who are we building the system for? Are we building an in-house system for a large corporation like Facebook or Google, or are we designing a SaaS service like Datadog [1], Splunk [2], etc?

Interviewer: That’s a great question. We are building it for internal use only.

Candidate: Which metrics do we want to collect?

Interviewer: We want to collect operational system metrics. These can be low-level usage data of the operating system, such as CPU load, memory usage, and disk space consumption. They can also be high-level concepts such as requests per second of a service or the running server count of a web pool. Business metrics are not in the scope of this design.

Candidate: What is the scale of the infrastructure we are monitoring with this system?

Interviewer: 100 million daily active users, 1,000 server pools, and 100 machines per pool.

Candidate: How long should we keep the data?

Interview: Let’s assume we want 1-year retention.

Candidate: May we reduce the resolution of the metrics data for long-term storage?

Interview: That’s a great question. We would like to be able to keep newly received data for 7 days. After 7 days, you may roll them up to a 1-minute resolution for 30 days. After 30 days, you may further roll them up at a 1-hour resolution.

Candidate: What are the supported alert channels?

Interviewer: Email, phone, PagerDuty, or webhooks (HTTP endpoints).

Candidate: Do we need to collect logs, such as error log or access log? Interviewer: No.

Candidate: Do we need to support distributed system tracing? Interviewer: No.

“指标监控与告警系统”在不同公司可能有完全不同的含义，因此在深入细节之前，必须先与面试官明确具体需求。例如，如果面试官的目标仅是基础设施指标，那么你不希望去设计一个专注于日志（如 Web 服务器错误日志或访问日志）的系统。

在正式设计之前，我们先完整理解问题并确定设计范围。

候选人：我们是为谁构建这个系统？是为像 Facebook 或 Google 这样的公司内部使用，还是设计一个类似 Datadog [1]、Splunk [2] 的 SaaS 服务？ 面试官：好问题。我们是为内部使用构建的。

候选人：我们要收集哪些指标？ 面试官：我们要收集运维系统指标。这些可以是操作系统的底层使用数据，例如 CPU 负载、内存使用率和磁盘空间消耗，也可以是高层的概念，比如某个服务的每秒请求数，或某个 Web 服务器池的运行服务器数量。业务指标不在本次设计范围内。

候选人：我们要监控的基础设施规模有多大？ 面试官：1 亿日活跃用户，1,000 个服务器池，每个池有 100 台机器。

候选人：我们需要保留数据多久？ 面试官：假设我们需要保留 1 年的数据。

候选人：为了长期存储，我们是否可以降低指标数据的精度？ 面试官：好问题。我们希望保留新接收到的数据 7 天；7 天后可以按 1 分钟精度保存 30 天；30 天后可以进一步按 1 小时精度保存。

候选人：支持哪些告警渠道？ 面试官：电子邮件、电话、PagerDuty 或 Webhooks（HTTP 端点）。

候选人：我们是否需要收集日志，例如错误日志或访问日志？ 面试官：不需要。

候选人：我们是否需要支持分布式系统追踪？ 面试官：不需要。

High-level requirements and assumptions

高层需求与假设

Now you have finished gathering requirements from the interviewer and have a clear scope of the design. The requirements are:
The infrastructure being monitored is large-scale.
- 100 million daily active users
- Assume we have 1,000 server pools, 100 machines per pool, 100 metrics per machine => ~10 million metrics
- 1-year data retention
- Data retention policy: raw form for 7 days, 1-minute resolution for 30 days, 1-hour resolution for 1 year.
A variety of metrics can be monitored, for example:
- CPU usage
- Request count
- Memory usage
- Message count in message queues

现在我们已经完成了与面试官的需求收集，并且明确了设计范围。需求如下：

监控的基础设施为大规模系统

每日活跃用户：1 亿

假设有 1,000 个服务器池，每个池有 100 台机器，每台机器监控 100 项指标 ⇒ 约 1,000 × 100 × 100 = 1,000,0000（1,000 万） 指标

数据保留时间：1 年

数据保留策略：

原始数据：保留 7 天

1 分钟粒度：保留 30 天

1 小时粒度：保留 1 年

可监控的多种指标类型（示例）：

CPU 使用率

请求计数

内存使用率

消息队列中的消息数

Non-functional requirements

非功能性需求（Non-functional requirements）

Scalability. The system should be scalable to accommodate growing metrics and alert volume.
Low latency. The system needs to have low query latency for dashboards and alerts.
Reliability. The system should be highly reliable to avoid missing critical alerts.
Flexibility. Technology keeps changing, so the pipeline should be flexible enough to easily integrate new technologies in the future.

Which requirements are out of scope?

Log monitoring. The Elasticsearch, Logstash, Kibana (ELK) stack is very popular for collecting and monitoring logs [3].
Distributed system tracing [4] [5]. Distributed tracing refers to a tracing solution that tracks service requests as they flow through distributed systems. It collects data as requests go from one service to another.

可扩展性（Scalability）：系统应能够随指标与告警量的增长而水平扩展。

低延迟（Low latency）：用于仪表盘与告警的查询需具备低延迟。

可靠性（Reliability）：系统需高度可靠，避免遗漏关键告警。

灵活性（Flexibility）：技术持续演进，数据管道应具备足够的灵活性，便于未来集成新技术。

不在本次范围内的需求（Out of scope）

日志监控（Log monitoring）：如 Elasticsearch、Logstash、Kibana（ELK）等用于日志采集与监控的方案 [3]。

分布式系统追踪（Distributed tracing） [4][5]：用于在分布式系统中跟踪服务请求跨服务的流转，采集请求在各服务间传播的数据。

Step 2 - Propose High-Level Design and Get Buy-In

In this section, we discuss some fundamentals of building the system, the data model, and the high-level design.

Fundamentals

A metrics monitoring and alerting system generally contains five components, as illustrated in Figure 2. metrics-monitoring-core-components

Figure 2 Five components of the system

Data collection: collect metric data from different sources.
Data transmission: transfer data from sources to the metrics monitoring system.
Data storage: organize and store incoming data.
Alerting: analyze incoming data, detect anomalies, and generate alerts. The system must be able to send alerts to different communication channels.
Visualization: present data in graphs, charts, etc. Engineers are better at identifying patterns, trends, or problems when data is presented visually, so we need visualization functionality.

Data model

Metrics data is usually recorded as a time-series, which contains a set of values with timestamps. The series can be identified by name and an optional set of tags.

Example 1 - What is the CPU load on production server instance i631 at 20:00? metrics-example-1

The data can be identified by the following table: metrics-example-1-data

The time series is identified by the metric name, labels and a single point in at a specific time.

Example 2 - What is the average CPU load across all web servers in the us-west region for the last 10min?

CPU.load host=webserver01,region=us-west 1613707265 50

CPU.load host=webserver01,region=us-west 1613707265 62

CPU.load host=webserver02,region=us-west 1613707265 43

CPU.load host=webserver02,region=us-west 1613707265 53

...

CPU.load host=webserver01,region=us-west 1613707265 76

CPU.load host=webserver01,region=us-west 1613707265 83

This is an example data we might pull from storage to answer that question. The average CPU load can be calculated by averaging the values in the last column of the rows.

The format shown above is called the line protocol and is used by many popular monitoring software in the market - eg Prometheus, OpenTSDB.

What every time series consists of: time-series-data-example

A good way to visualize how data looks like: time-series-data-viz

The x axis is the time
the y axis is the dimension you're querying - eg metric name, tag, etc.

The data access pattern is write-heavy and spiky reads as we collect a lot of metrics, but they are infrequently accessed, although in bursts when eg there are ongoing incidents.

The data storage system is the heart of this design.

It is not recommended to use a general-purpose database for this problem, although you could achieve good scale \w expert-level tuning.
Using a NoSQL database can work in theory, but it is hard to devise a scalable schema for effectively storing and querying time-series data.

There are many databases, specifically tailored for storing time-series data. Many of them support custom query interfaces which allow for effective querying of time-series data.

OpenTSDB is a distributed time-series database, but it is based on Hadoop and HBase. If you don't have that infrastructure provisioned, it would be hard to use this tech.
Twitter uses MetricsDB, while Amazon offers Timestream.
The two most popular time-series databases are InfluxDB and Prometheus.
They are designed to store large volumes of time-series data. Both of them are based on in-memory cache + on-disk storage.

Example scale of InfluxDB - more than 250k writes per second when provisioned with 8 cores and 32gb RAM: influxdb-scale

It is not expected for you to understand the internals of a metrics database as it is niche knowledge. You might be asked only if you've mentioned it on your resume.

For the purposes of the interview, it is sufficient to understand that metrics are time-series data and to be aware of popular time-series databases, like InfluxDB.

One nice feature of time-series databases is the efficient aggregation and analysis of large amounts of time-series data by labels. InfluxDB, for example, builds indexes for each label.

It is critical, however, to keep the cardinality of labels low - ie, not using too many unique labels.

High-level Design

high-level-design

Metrics source - can be application servers, SQL databases, message queues, etc.
Metrics collector - Gathers metrics data and writes to time-series database
Time-series database - stores metrics as time-series. Provides a custom query interface for analyzing large amounts of metrics.
Query service - Makes it easy to query and retrieve data from the time-series DB. Could be replaced entirely by the DB's interface if it's sufficiently powerful.
Alerting system - Sends alert notifications to various alerting destinations.
Visualization system - Shows metrics in the form of graphs/charts.

Step 3 - Design Deep Dive

Let's deep dive into several of the more interesting parts of the system.

Metrics collection

For metrics collection, occasional data loss is not critical. It's acceptable for clients to fire and forget. metrics-collection

There are two ways to implement metrics collection - pull or push.

Here's how the pull model might look like: pull-model-example

For this solution, the metrics collector needs to maintain an up-to-date list of services and metrics endpoints. We can use Zookeeper or etcd for that purpose - service discovery.

Service discovery contains contains configuration rules about when and where to collect metrics from: service-discovery-example

Here's a detailed explanation of the metrics collection flow: metrics-collection-flow

Metrics collector fetches configuration metadata from service discovery. This includes pulling interval, IP addresses, timeout & retry params.
Metrics collector pulls metrics data via a pre-defined http endpoint (eg /metrics). This is typically done by a client library.
Alternatively, the metrics collector can register a change event notification with the service discovery to be notified once the service endpoint changes.
Another option is for the metrics collector to periodically poll for metrics endpoint configuration changes.

At our scale, a single metrics collector is not enough. There must be multiple instances. However, there must also be some kind of synchronization among them so that two collectors don't collect the same metrics twice.

One solution for this is to position collectors and servers on a consistent hash ring and associate a set of servers with a single collector only: consistent-hash-ring

With the push model, on the other hand, services push their metrics to the metrics collector proactively: push-model-example

In this approach, typically a collection agent is installed alongside service instances. The agent collects metrics from the server and pushes them to the metrics collector. metrics-collector-agent

With this model, we can potentially aggregate metrics before sending them to the collector, which reduces the volume of data processed by the collector.

On the flip side, metrics collector can reject push requests as it can't handle the load. It is important, hence, to add the collector to an auto-scaling group behind a load balancer.

so which one is better? There are trade-offs between both approaches and different systems use different approaches:

Prometheus uses a pull architecture
Amazon Cloud Watch and Graphite use a push architecture

Here are some of the main differences between push and pull:

	Pull	Push
Easy debugging	The /metrics endpoint on application servers used for pulling metrics can be used to view metrics at any time. You can even do this on your laptop. Pull wins.	If the metrics collector doesn’t receive metrics, the problem might be caused by network issues.
Health check	If an application server doesn’t respond to the pull, you can quickly figure out if an application server is down. Pull wins.	If the metrics collector doesn’t receive metrics, the problem might be caused by network issues.
Short-lived jobs		Some of the batch jobs might be short-lived and don’t last long enough to be pulled. Push wins. This can be fixed by introducing push gateways for the pull model [22].
Firewall or complicated network setups	Having servers pulling metrics requires all metric endpoints to be reachable. This is potentially problematic in multiple data center setups. It might require a more elaborate network infrastructure.	If the metrics collector is set up with a load balancer and an auto-scaling group, it is possible to receive data from anywhere. Push wins.
Performance	Pull methods typically use TCP.	Push methods typically use UDP. This means the push method provides lower-latency transports of metrics. The counterargument here is that the effort of establishing a TCP connection is small compared to sending the metrics payload.
Data authenticity	Application servers to collect metrics from are defined in config files in advance. Metrics gathered from those servers are guaranteed to be authentic.	Any kind of client can push metrics to the metrics collector. This can be fixed by whitelisting servers from which to accept metrics, or by requiring authentication.

There is no clear winner. A large organization probably needs to support both. There might not be a way to install a push agent in the first place.

Scale the metrics transmission pipeline

metrics-transmission-pipeline

The metrics collector is provisioned in an auto-scaling group, regardless if we use the push or pull model.

There is a chance of data loss if the time-series DB is down, however. To mitigate this, we'll provision a queuing mechanism: queuing-mechanism

Metrics collectors push metrics data into kafka
Consumers or stream processing services such as Apache Storm, Flink or Spark process the data and push it to the time-series DB

This approach has several advantages:

Kafka is used as a highly-reliable and scalable distributed message platform
It decouples data collection and data processing from one another
It can prevent data loss by retaining the data in Kafka

Kafka can be configured with one partition per metric name, so that consumers can aggregate data by metric names. To scale this, we can further partition by tags/labels and categorize/prioritize metrics to be collected first. metrics-collection-kafka

The main downside of using Kafka for this problem is the maintenance/operation overhead. An alternative is to use a large-scale ingestion system like Gorilla. It can be argued that using that would be as scalable as using Kafka for queuing.

Where aggregations can happen

Metrics can be aggregated at several places. There are trade-offs between different choices:

Collection agent - client-side collection agent only supports simple aggregation logic. Eg collect a counter for 1m and send it to the metrics collector.
Ingestion pipeline - To aggregate data before writing to the DB, we need a stream processing engine like Flink. This reduces write volume, but we lose data precision as we don't store raw data.
Query side - We can aggregate data when we run queries via our visualization system. There is no data loss, but queries can be slow due to a lot of data processing.

Query Service

Having a separate query service from the time-series DB decouples the visualization and alerting system from the database, which enables us to decouple the DB from clients and change it at will.

We can add a Cache layer here to reduce the load to the time-series database: cache-layer-query-service

We can also avoid adding a query service altogether as most visualization and alerting systems have powerful plugins to integrate with most time-series databases. With a well-chosen time-series DB, we might not need to introduce our own caching layer as well.

Most time-series DBs don't support SQL simply because it is ineffective for querying time-series data. Here's an example SQL query for computing an exponential moving average:

select id,
       temp,
       avg(temp) over (partition by group_nr order by time_read) as rolling_avg
from (
  select id,
         temp,
         time_read,
         interval_group,
         id - row_number() over (partition by interval_group order by time_read) as group_nr
  from (
    select id,
    time_read,
    "epoch"::timestamp + "900 seconds"::interval * (extract(epoch from time_read)::int4 / 900) as interval_group,
    temp
    from readings
  ) t1
) t2
order by time_read;

Here's the same query in Flux - query language used in InfluxDB:

from(db:"telegraf")
  |> range(start:-1h)
  |> filter(fn: (r) => r._measurement == "foo")
  |> exponentialMovingAverage(size:-10s)

Storage layer

It is important to choose the time-series database carefully.

According to research published by Facebook, ~85% of queries to the operational store were for data from the past 26h.

If we choose a database, which harnesses this property, it could have significant impact on system performance. InfluxDB is one such option.

Regardless of the database we choose, there are some optimizations we might employ.

Data encoding and compression can significantly reduce the size of data. Those features are usually built into a good time-series database. double-delta-encoding

In the above example, instead of storing full timestamps, we can store timestamp deltas.

Another technique we can employ is down-sampling - converting high-resolution data to low-resolution in order to reduce disk usage.

We can use that for old data and make the rules configurable by data scientists, eg:

7d - no down-sampling
30d - down-sample to 1min
1y - down-sample to 1h

For example, here's a 10-second resolution metrics table:

metric	timestamp	hostname	Metric_value
cpu	2021-10-24T19:00:00Z	host-a	10
cpu	2021-10-24T19:00:10Z	host-a	16
cpu	2021-10-24T19:00:20Z	host-a	20
cpu	2021-10-24T19:00:30Z	host-a	30
cpu	2021-10-24T19:00:40Z	host-a	20
cpu	2021-10-24T19:00:50Z	host-a	30

down-sampled to 30-second resolution:

metric	timestamp	hostname	Metric_value (avg)
cpu	2021-10-24T19:00:00Z	host-a	19
cpu	2021-10-24T19:00:30Z	host-a	25

Finally, we can also use cold storage to use old data, which is no longer used. The financial cost for cold storage is much lower.

Alerting system

alerting-system

Configuration is loaded to cache servers. Rules are typically defined in YAML format. Here's an example:

- name: instance_down
  rules:

  # Alert for any instance that is unreachable for >5 minutes.
  - alert: instance_down
    expr: up == 0
    for: 5m
    labels:
      severity: page

The alert manager fetches alert configurations from cache. Based on configuration rules, it also calls the query service at a predefined interval. If a rule is met, an alert event is created.

Other responsibilities of the alert manager are:

Filtering, merging and deduplicating alerts. Eg if an alert of a single instance is triggered multiple times, only one alert event is generated.
Access control - it is important to restrict alert-management operations to certain individuals only
Retry - the manager ensures that the alert is propagated at least once.

The alert store is a key-value database, like Cassandra, which keeps the state of all alerts. It ensures a notification is sent at least once. Once an alert is triggered, it is published to Kafka.

Finally, alert consumers pull alerts data from Kafka and send notifications over to different channels - Email, text message, PagerDuty, webhooks.

In the real-world, there are many off-the-shelf solutions for alerting systems. It is difficult to justify building your own system in-house.

Visualization system

The visualization system shows metrics and alerts over a time period. Here's an dashboard built with Grafana: grafana-dashboard

A high-quality visualization system is very hard to build. It is hard to justify not using an off-the-shelf solution like Grafana.

Step 4 - Wrap up

Here's our final design: final-design

Chapter Summary

download

Reference Materials

[1] Datadog: https://www.datadoghq.com/

[2] Splunk: https://www.splunk.com/

[3] Elastic stack: https://www.elastic.co/elastic-stack

[4] Dapper, a Large-Scale Distributed Systems Tracing Infrastructure: https://research.google/pubs/pub36356/

[5] Distributed Systems Tracing with Zipkin: https://blog.twitter.com/engineering/en_us/a/2012/distributed-systems-tracing-with-zipkin.html

[6] Prometheus: https://prometheus.io/docs/introduction/overview/

[7] OpenTSDB - A Distributed, Scalable Monitoring System: http://opentsdb.net/

[8] Data model: : https://prometheus.io/docs/concepts/data_model/

[9] Schema design for time-series data | Cloud Bigtable Documentation https://cloud.google.com/bigtable/docs/schema-design-time-series

[10] MetricsDB: TimeSeries Database for storing metrics at Twitter: https://blog.twitter.com/engineering/en_us/topics/infrastructure/2019/metricsdb.html

[11] Amazon Timestream: https://aws.amazon.com/timestream/

[12] DB-Engines Ranking of time-series DBMS: https://db-engines.com/en/ranking/time+series+dbms

[13] InfluxDB: https://www.influxdata.com/

[14] etcd: https://etcd.io

[15] Service Discovery with Zookeeper https://cloud.spring.io/spring-cloud-zookeeper/1.2.x/multi/multi_spring-cloud-zookeeper-discovery.html

[16] Amazon CloudWatch: https://aws.amazon.com/cloudwatch/

[17] Graphite: https://graphiteapp.org/

[18] Push vs Pull: http://bit.ly/3aJEPxE

[19] Pull doesn’t scale - or does it?: https://prometheus.io/blog/2016/07/23/pull-does-not-scale-or-does-it/

[20] Monitoring Architecture: https://developer.lightbend.com/guides/monitoring-at-scale/monitoring-architecture/architecture.html

[21] Push vs Pull in Monitoring Systems: https://giedrius.blog/2019/05/11/push-vs-pull-in-monitoring-systems/

[22] Pushgateway: https://github.com/prometheus/pushgateway

[23] Building Applications with Serverless Architectures https://aws.amazon.com/lambda/serverless-architectures-learn-more/

[24] Gorilla: A Fast, Scalable, In-Memory Time Series Database: http://www.vldb.org/pvldb/vol8/p1816-teller.pdf

[25] Why We’re Building Flux, a New Data Scripting and Query Language: https://www.influxdata.com/blog/why-were-building-flux-a-new-data-scripting-and-query-language/

[26] InfluxDB storage engine: https://docs.influxdata.com/influxdb/v2.0/reference/internals/storage-engine/

[27] YAML: https://en.wikipedia.org/wiki/YAML

[28] Grafana Demo: https://play.grafana.org/