Azure Application Insights: Monitoring, KQL Queries and Observability in Production
DEV Community Grade 10 10h ago

Azure Application Insights: Monitoring, KQL Queries and Observability in Production

At 2am on a Tuesday, an IP address change in Microsoft infrastructure silently broke our entire integration pipeline at Blue Yonder. Messages stopped flowing between Salesforce and ServiceNow. No errors surfaced in the application logs. The systems reported themselves as healthy. But nothing was moving. It was Azure Application Insights that found it. The Application Map showed 100% failure rate on the Service Bus node. I clicked the connection line. Saw connection refused errors on the dependency calls. Traced the root cause to the IP change within minutes. I wrote a Logic App to inspect the dead-letter queue and replay every failed message. Zero data loss. Zero SLA breach. That incident shaped how I think about observability forever. App Insights is not a monitoring tool you add at the end. It is the foundation you build everything on from day one. This post covers everything I learned using App Insights in production - the queries, the alerts, the patterns, and the lessons that only come from real incidents. What App Insights Collects Automatically The first thing that surprises most engineers is how much App Insights collects with zero configuration. Add one NuGet package and one line in Program.cs and you immediately get: Every HTTP request your API receives - the URL, method, response code, duration, and whether it succeeded. Every dependency call your app makes outbound - SQL queries, HTTP calls to external APIs, Service Bus operations, Blob Storage reads. Each one tracked with duration and success status. Every unhandled exception with the full stack trace, the exact line of code that failed, and all exception properties including inner exceptions. Every log message you write with ILogger - Information, Warning, Error levels all captured with custom properties. Performance counters including CPU usage, memory consumption, and request queue length collected automatically on App Service. Setting It Up in C# ASP.NET Core Three steps and you are done. Install the NuGet package: Microsoft.ApplicationInsights.AspNetCore Add one line in Program.cs: builder.Services.AddApplicationInsightsTelemetry( builder.Configuration["ApplicationInsights:ConnectionString"] ); Add the connection string to appsettings.json or Azure App Service Configuration (never hardcode it). That is it. Every ILogger call in your code now goes to App Insights automatically. No code changes needed in your controllers or services. For custom events and metrics - tracking business-level events beyond technical telemetry - inject TelemetryClient and call TrackEvent() with a name and custom properties. I used this at Blue Yonder to track every integration completion with the system name and record count as properties, so I could query processing volumes by system over time. KQL - The Query Language That Changes Everything KQL (Kusto Query Language) is what makes App Insights powerful rather than just a log viewer. It reads left to right with pipe operators - each step filters or transforms the result of the previous step. The basic structure is always the same: Start with a table name, then pipe through operators like where, project, summarize, order by, and extend. Once you understand five queries you can write almost any investigation query you need. Here are the five I used most in production: All failed requests in the last hour: requests | where timestamp > ago(1h) | where success == false | project timestamp, name, url, resultCode, duration | order by timestamp desc This was my first query every morning and after every deployment. If it returned rows, I had work to do. Slowest API endpoints today: requests | where timestamp > ago(24h) | summarize AvgDuration = avg(duration), Count = count() by name | order by AvgDuration desc | take 10 This identified performance regressions immediately after deployments before customers reported them. All exceptions with full detail: exceptions | where timestamp > ago(4h) | project timestamp, type, outerMessage, innermostMessage, method | order by timestamp desc The innermostMessage field is the one you want - it has the root cause, not the wrapper exception. Dependency failures - what external calls failed: dependencies | where timestamp > ago(1h) | where success == false | project timestamp, type, target, name, duration, resultCode | order by timestamp desc This was how I found the Microsoft IP change. Service Bus dependencies all showing connection refused, all starting at the same timestamp. Error rate by hour - spot patterns: requests | where timestamp > ago(24h) | summarize Total=count(), Failed=countif(success==false) by bin(timestamp,1h) | extend ErrorRate = round(Failed*100.0/Total, 2) | order by timestamp asc This chart pattern shows you whether failures are random noise or a systematic problem getting worse. The Queries I Wrote at Blue Yonder Beyond the standard queries, I built several patterns specific to integration monitoring that I have not seen documented elsewhere. Cross-system corre

At 2am on a Tuesday, an IP address change in Microsoft infrastructure silently broke our entire integration pipeline at Blue Yonder. Messages stopped flowing between Salesforce and ServiceNow. No errors surfaced in the application logs. The systems reported themselves as healthy. But nothing was moving. It was Azure Application Insights that found it. The Application Map showed 100% failure rate on the Service Bus node. I clicked the connection line. Saw connection refused errors on the dependency calls. Traced the root cause to the IP change within minutes. I wrote a Logic App to inspect the dead-letter queue and replay every failed message. Zero data loss. Zero SLA breach. That incident shaped how I think about observability forever. App Insights is not a monitoring tool you add at the end. It is the foundation you build everything on from day one. This post covers everything I learned using App Insights in production - the queries, the alerts, the patterns, and the lessons that only come from real incidents. What App Insights Collects Automatically The first thing that surprises most engineers is how much App Insights collects with zero configuration. Add one NuGet package and one line in Program.cs and you immediately get: Every HTTP request your API receives - the URL, method, response code, duration, and whether it succeeded. Every dependency call your app makes outbound - SQL queries, HTTP calls to external APIs, Service Bus operations, Blob Storage reads. Each one tracked with duration and success status. Every unhandled exception with the full stack trace, the exact line of code that failed, and all exception properties including inner exceptions. Every log message you write with ILogger - Information, Warning, Error levels all captured with custom properties. Performance counters including CPU usage, memory consumption, and request queue length collected automatically on App Service. Setting It Up in C# ASP.NET Core Three steps and you are done. Install the NuGet package: Microsoft.ApplicationInsights.AspNetCore Add one line in Program.cs: builder.Services.AddApplicationInsightsTelemetry( builder.Configuration["ApplicationInsights:ConnectionString"] ); Add the connection string to appsettings.json or Azure App Service Configuration (never hardcode it). That is it. Every ILogger call in your code now goes to App Insights automatically. No code changes needed in your controllers or services. For custom events and metrics - tracking business-level events beyond technical telemetry - inject TelemetryClient and call TrackEvent() with a name and custom properties. I used this at Blue Yonder to track every integration completion with the system name and record count as properties, so I could query processing volumes by system over time. KQL - The Query Language That Changes Everything KQL (Kusto Query Language) is what makes App Insights powerful rather than just a log viewer. It reads left to right with pipe operators - each step filters or transforms the result of the previous step. The basic structure is always the same: Start with a table name, then pipe through operators like where, project, summarize, order by, and extend. Once you understand five queries you can write almost any investigation query you need. Here are the five I used most in production: All failed requests in the last hour: requests | where timestamp > ago(1h) | where success == false | project timestamp, name, url, resultCode, duration | order by timestamp desc This was my first query every morning and after every deployment. If it returned rows, I had work to do. Slowest API endpoints today: requests | where timestamp > ago(24h) | summarize AvgDuration = avg(duration), Count = count() by name | order by AvgDuration desc | take 10 This identified performance regressions immediately after deployments before customers reported them. All exceptions with full detail: exceptions | where timestamp > ago(4h) | project timestamp, type, outerMessage, innermostMessage, method | order by timestamp desc The innermostMessage field is the one you want - it has the root cause, not the wrapper exception. Dependency failures - what external calls failed: dependencies | where timestamp > ago(1h) | where success == false | project timestamp, type, target, name, duration, resultCode | order by timestamp desc This was how I found the Microsoft IP change. Service Bus dependencies all showing connection refused, all starting at the same timestamp. Error rate by hour - spot patterns: requests | where timestamp > ago(24h) | summarize Total=count(), Failed=countif(success==false) by bin(timestamp,1h) | extend ErrorRate = round(Failed*100.0/Total, 2) | order by timestamp asc This chart pattern shows you whether failures are random noise or a systematic problem getting worse. The Queries I Wrote at Blue Yonder Beyond the standard queries, I built several patterns specific to integration monitoring that I have not seen documented elsewhere. Cross-system correlation - following one record through every system it touched. Every request in our pipeline carried a CorrelationId custom property. With this query I could trace a single Salesforce case through Logic App orchestration, Function App transformation, Service Bus messaging, and the final ServiceNow API call - seeing exact timestamps and durations at each step: union requests, dependencies, traces, exceptions | where timestamp > ago(24h) | where tostring(customDimensions["CorrelationId"]) == "your-id" | project timestamp, itemType, name, message, duration, success | order by timestamp asc Token refresh monitoring - tracking the 3-month Salesforce and 6-month ServiceNow credential refresh cycles that were a significant operational risk before I centralized them in Key Vault: customEvents | where name == "TokenRefreshed" | extend System = tostring(customDimensions["System"]) | summarize LastRefresh=max(timestamp), SuccessCount=countif(tobool(customDimensions["Success"])==true) by System Integration pipeline health - the Monday morning query that showed overnight processing status for every integration flow: requests | where timestamp > ago(12h) | where name contains "Integration" | summarize Success=countif(success==true), Failed=countif(success==false), AvgDuration=avg(duration) by name | extend Status = iff(Failed > 0, "DEGRADED", "HEALTHY") | order by Status asc Smart Alerts - Stop Watching Dashboards The real power of App Insights is alerts that find problems for you. Metric alerts fire when a number crosses a threshold - failed requests greater than 5 in 5 minutes, response time average greater than 2 seconds, exception count greater than 10 per hour. Log alerts run a KQL query on a schedule and fire if the results meet a condition. This is how I monitored the dead-letter queue - a query that ran every 5 minutes and fired immediately if any messages appeared in the DLQ. In production, a non-empty DLQ is always a signal that something needs investigation. Smart detection requires no configuration. App Insights learns your baseline automatically and alerts on anomalies - unusual failure rate spikes, abnormal response time degradation, memory leak patterns. It caught two issues at Blue Yonder that I would not have noticed from metrics alone. Live Metrics During Deployments Live Metrics shows you what is happening with less than one second latency - incoming requests per second, exception rate, dependency call rate, CPU and memory of every running instance. I had Live Metrics open on a second monitor during every production deployment. If the exception rate spiked within 30 seconds of a deploy I knew immediately to roll back. If it stayed flat for 2 minutes the deployment was clean. This practice caught one bad deployment at Blue Yonder that would have caused a production incident if we had waited for customer reports. The rollback took 90 seconds. The alternative would have been hours of incident response. Application Map The Application Map is the fastest way to understand what broke and where. It shows every component of your system as a node - your API, the SQL database, Service Bus, external APIs - with connection lines showing call volume and failure rate between them. When the Microsoft IP change broke our pipeline, the Service Bus node on the Application Map turned red with a 100% failure rate. I clicked the connection line between our API and Service Bus. The details pane showed connection refused with the target IP address. That one click saved 30 minutes of log digging. Distributed Tracing Every request in App Insights gets a unique Operation ID that flows automatically through every system it touches. A single user request that goes through APIM, Logic App, Function App, Service Bus, and SQL - all with the same Operation ID. In Transaction Search, paste the Operation ID and see every step in chronological order with exact timestamps and durations. The full story of one request across your entire distributed system in one view. The practical implication for code: add a CorrelationId to your custom log entries and custom events. Then you can find every log entry related to one business transaction even when it crosses system boundaries. Connecting App Insights Across the Azure Stack App Insights works with every Azure service. Logic Apps through diagnostic settings send all run history to the same Log Analytics workspace you query with KQL. Function Apps with AddApplicationInsightsTelemetry() track every function execution automatically. APIM connected to App Insights logs every API call with the caller's subscription key. Service Bus diagnostic settings expose message counts and DLQ depth as metrics. The goal is one workspace where you can query across all these data sources simultaneously. When an incident spans multiple systems - which in integration work they always do - you want one place to look, not five dashboards. Key Lessons From Production Set up App Insights before you write business logi

Comments

No comments yet. Start the discussion.