CHAPTER 15
Monitoring and Logging in Azure
Updated: May 15, 2026
20 min read
# CHAPTER 15
Monitoring and Logging in Azure
1. Introduction
Deploying architecture into the cloud is only the first step of Cloud Engineering. Once an application is live in production, your primary job shifts to Observability. If your Virtual Machine crashes at 3:00 AM, how do you know it happened? Why did it happen? To answer these questions, Microsoft provides Azure Monitor, a comprehensive suite of tools that includes Log Analytics and Application Insights. In this chapter, we will learn how to capture application logs, visualize infrastructure metrics, and configure proactive alerting.2. Learning Objectives
By the end of this chapter, you will be able to:- Define the difference between Logs (Events) and Metrics (Numbers).
- Utilize Log Analytics and the Kusto Query Language (KQL) to search logs.
- Create visual Dashboards using Azure Monitor.
- Understand Application Insights for deep code-level tracing.
- Configure Alert Rules to page engineers via email or SMS.
3. Beginner-Friendly Explanation
Imagine managing a massive factory complex.- Log Analytics (The Security Cameras): Every time a door opens, a machine starts, or an error occurs, a detailed note is written in a master journal. If a machine breaks down, you can read the journal to see exactly what buttons the operator pressed right before the crash.
- Azure Monitor Metrics (The Dials and Alarms): A massive control room filled with dials showing the current temperature and speed of every machine (Metrics). You set a rule: "If the temperature dial on Machine A goes over 100 degrees for more than 5 minutes, instantly sound a loud alarm (Alert Rule) to wake up the manager."
4. Metrics vs. Logs
- Metrics: Numerical values collected at regular intervals. (e.g., "VM CPU is at 85%", "Database DTU is at 90%"). They are incredibly lightweight, fast to query, and perfect for triggering alerts.
- Logs: Text records of events that occurred. (e.g., "User Alice failed to login at 10:04 AM due to incorrect password"). They contain deep context for troubleshooting but are heavier to query.
5. Log Analytics Workspace and KQL
To store all these logs, you create a central Log Analytics Workspace. You configure all your VMs, App Services, and Databases to stream their logs directly into this massive, searchable bucket. To search the bucket, you use Kusto Query Language (KQL)—a powerful, SQL-like language designed by Microsoft specifically for extremely fast log searching. *Example KQL Query:* "Show me the top 5 errors from the Web Server in the last hour."
kusto
6. Alert Rules and Action Groups
A Dashboard is useless if nobody is looking at it. You need proactive alerts.- Alert Rule: You define a condition: "If VM CPU exceeds 90% for 5 consecutive minutes, trigger an Alert."
- Action Group: What happens when the alert triggers? You define an Action Group that tells Azure: "Send an Email to the DevOps team, send an SMS text message to the On-Call Manager, and trigger an Azure Function to attempt to restart the server."
7. Mini Project: Monitor VM Health
Let's create an alarm system for our virtual machine.Step-by-Step Tutorial: *(Assumption: You have a running Azure VM from Chapter 4).*
- 1. In the Azure Portal, search for Monitor.
- 2. On the left menu, click Alerts, then click + Create > Alert rule.
-
3.
Select a resource: Find and select your Virtual Machine (
my-first-webserver). Click Apply.
- 4. Condition: Search for and select the signal name Percentage CPU.
- 5. Scroll down to Alert logic:
-
Operator:
Greater than
-
Threshold value:
90(Meaning 90%).
- 6. Click Next: Actions.
- 7. Click + Create action group.
-
Name:
DevOps-Email-Alerts
-
Short name:
DevOpsAlert
-
8.
Click Notifications tab. Notification Type:
Email/SMS message. Select Email and type your personal email address. Click OK, then click Review + Create to create the Action Group.
- 9. Click Next: Details.
-
10.
Alert rule name:
CRITICAL: VM CPU OVER 90%.
- 11. Click Review + create, then Create.
- 12. *The Result:* You now have an enterprise-grade alarm system. If your Virtual Machine is hijacked by a cryptominer and its CPU spikes to 100%, you will immediately receive an email so you can log in and terminate the instance.
8. Real-World Scenarios
A DevOps team manages a fleet of 50 App Service instances. Instead of checking each server manually, they stream all metrics into a central Azure Monitor Dashboard displayed on a TV in their office. Suddenly, an Alert Rule triggers an automated message in their Microsoft Teams channel: "Database Connection Pool Exhausted." Because of the proactive alert, an engineer scales up the database 10 minutes before the web application completely crashes, saving the company from an outage.9. Best Practices
- Application Insights: If you want to know exactly *why* your C# or Node.js code is slow, enable Application Insights. It injects an invisible agent into your code that tracks every single database query and HTTP request. It can generate an "Application Map"—a visual flowchart showing exactly which microservice is causing the bottleneck.
10. Common Mistakes
- Alert Fatigue: Beginners often set their alerting thresholds too low (e.g., "Alert me if CPU hits 50%"). This causes the system to send 50 emails a day for completely normal traffic fluctuations. Engineers suffer "Alert Fatigue" and begin ignoring the emails. When a real 100% CPU crash happens, they ignore that email too. Only alert on actionable emergencies.
11. CLI Examples
To view the metrics of a specific VM via the Azure CLI:
bash
12. Exercises
- 1. Explain the fundamental difference between a Log and a Metric. Which is better suited for triggering a rapid alert?
- 2. Why is "Alert Fatigue" considered a critical operational risk, and how can it be mitigated?
13. FAQs
Q: Do I have to pay for Azure Monitor? A: Basic metrics (like VM CPU and Network traffic) are stored for 93 days for free. However, if you dump gigabytes of text logs into a Log Analytics Workspace, you pay per GB ingested and stored.14. Interview Questions
- Q: Describe the architectural flow of establishing long-term compliance retention for infrastructure logs using Azure Log Analytics and Azure Storage Archives.
- Q: A production web application experiences intermittent 500 Internal Server Errors. Describe your step-by-step methodology utilizing Azure Monitor, Kusto Query Language (KQL), and Application Insights to isolate the root cause.