CHAPTER 15
Monitoring and Logging
Updated: May 15, 2026
25 min read
# CHAPTER 15
Monitoring and Logging
1. Introduction
You have successfully architected a globally distributed system consisting of an API Gateway, 50 microservices, 3 databases, a Redis cache, and an Elasticsearch cluster. At 3:00 AM on a Saturday, the CEO calls you. The checkout system is failing, and the company is losing $10,000 a minute. You open your laptop. Which of the 50 microservices is broken? Is it a database lock, a network timeout, or a memory leak? If you do not have world-class Monitoring and Logging, you are flying blind in a hurricane, and the company will fail before you find the bug. In this chapter, we will master the discipline of Observability. We will engineer structured logs, deploy Distributed Tracing to track requests across server fleets, and build alerting dashboards to fix disasters before the users even notice them.2. Learning Objectives
By the end of this chapter, you will be able to:- Define the difference between Logging, Metrics, and Distributed Tracing.
- Implement "Structured Logging" (JSON format) for machine-readable analysis.
- Architect a Centralized Logging pipeline (the ELK stack).
- Understand how Correlation IDs enable Distributed Tracing across microservices.
- Design actionable Alerts and prevent "Alert Fatigue."
3. The Three Pillars of Observability
A robust system requires three distinct lenses to see into the architecture.- 1. Logs (The Story): An immutable, time-stamped record of discrete events that happened over time (e.g., "User 123 logged in at 10:04 AM", "Database query failed").
- 2. Metrics (The Vital Signs): Numerical data measured over intervals of time (e.g., "CPU utilization is at 85%", "API latency is 200ms", "Error rate is 2%"). Metrics tell you *that* a system is failing; logs tell you *why* it is failing.
- 3. Tracing (The Journey): A specialized tool for microservices that tracks the lifecycle of a single user request as it travels through multiple different servers.
4. Structured Logging (The JSON Revolution)
The old way of logging was writing sentences:[INFO] User 123 purchased item 456 for $20.
- The Problem: When you have 10 billion logs, and you want to search for "All purchases over $50," a computer cannot easily read that English sentence.
- Structured Logging: You must log events as structured JSON objects.
{"level": "INFO", "event": "purchase", "user_id": 123, "item": 456, "amount": 20.00}
- The Benefit: Now, your logging platform can parse the JSON, allowing you to run powerful mathematical queries instantly across billions of records.
5. Centralized Logging (The ELK Stack)
In a distributed system, you cannot SSH into 50 different servers to read 50 different text files.- The Centralization Pipeline: You must aggregate all logs into a single, massive, searchable database.
- The ELK Stack (Industry Standard):
- E (Elasticsearch): The search engine that indexes the billions of log lines.
- L (Logstash/Fluentd): The agent installed on every server that constantly ships the logs to Elasticsearch.
- K (Kibana): The visual dashboard UI where engineers type search queries and view graphs.
- *(Note: Cloud alternatives like Datadog or Splunk are also massively popular).*
6. Distributed Tracing (Correlation IDs)
If a user clicks "Checkout", the request hits the Gateway, the Order Service, the Payment Service, and the Inventory Service. If the Payment Service crashes, how do you string the logs together?-
The Correlation ID: When the request hits the API Gateway, the gateway generates a unique random string (e.g.,
req_abc123).
-
The Relay: The Gateway passes
req_abc123to the Order Service in the HTTP Header. The Order Service passes it to the Payment Service.
-
The Log: Every single microservice includes
"correlation_id": "req_abc123"in its logs.
-
The Magic: When investigating the crash, you search Kibana for
req_abc123. You instantly see the exact path, latency, and failure point of that specific transaction mapped out across 4 different servers.
7. Diagrams/Visual Suggestions
*Architecture Diagram: The Observability Pipeline*
text
8. Best Practices
- Alert Fatigue: If you configure your system to page the on-call engineer every time CPU hits 70%, the engineer's phone will ring 50 times a day. After a week, they will start ignoring the alerts entirely (Alert Fatigue). *Best Practice:* Only trigger emergency pages for high-level business impact (e.g., "Checkout success rate dropped below 95%"). Let the auto-scaling groups handle the CPU spikes.
9. Common Mistakes
-
Logging Sensitive PII: A developer lazily logs the entire HTTP request payload for debugging:
log.info(request.body). *The Failure:* The body contains the user's plain-text password and unencrypted credit card number. The logs are shipped to a central server accessible by hundreds of employees. This is a catastrophic, illegal data breach (GDPR/PCI violation). *The Fix:* Implement strict log scrubbers that maskpasswordandcredit_cardfields before they ever leave the server.
10. Mini Project: Build an Incident Response Dashboard
Let's build the screens the engineering team watches on Black Friday.- 1. The Infrastructure View: A Grafana dashboard pulling metrics from Prometheus. It shows 5 dials: Total Requests per Second (RPS), Average API Latency (ms), CPU Load, Database Connections, and HTTP 500 Error Rate.
- 2. The Business View: A dashboard showing live Revenue per Minute, Active Shopping Carts, and Checkout Failure Rate.
- 3. The Alert Logic: We set an automated trigger: "If HTTP 500 Error Rate > 1% for 3 consecutive minutes, trigger a PagerDuty alert to wake up the Lead Engineer."
11. Practice Exercises
- 1. Define the "Three Pillars of Observability" (Logs, Metrics, Tracing). Explain how they complement each other when diagnosing a complex system crash.
- 2. Explain the necessity of "Structured Logging" (JSON) over traditional plain-text logging when dealing with billions of events in an enterprise system.
12. MCQs with Answers
Question 1
In a microservices architecture, a single user action (like "Checkout") may require 5 different independent servers to communicate with each other. To successfully debug a failure, engineers must be able to track this single request as it jumps from server to server. What specific architectural mechanism makes this tracking possible?
Question 2
When an engineering team configures overly sensitive alarms (e.g., triggering a pager every time a minor background task fails), the engineers eventually become desensitized and begin ignoring the alarms, potentially missing a catastrophic system crash. What is this psychological operational failure called?
13. Interview Questions
- Q: Explain the mechanical architecture of a Centralized Logging Pipeline (such as the ELK stack or Datadog). Why is it an absolute requirement to ship logs off the local application servers immediately? (Hint: What happens if the server's hard drive fails?)
- Q: Walk me through the catastrophic security implications of logging full HTTP payloads. How do you architect log sanitization pipelines to ensure compliance with privacy laws like GDPR?
- Q: Compare "Metrics" to "Logs." If a system is experiencing a slow memory leak, which of these two observability pillars is best suited to alert the team to the problem *before* the server crashes?