Monitoring Dataverse Query Throttling with Application Insights (Preview)

Views (97)

Hemanth Kumar(D365 ... Microsoft Employee

Like(0)

Report

Audience: admins, makers, SRE/ops, and integration teams using Dataverse at scale.

Introduction

Query throttling is how Dataverse protects the platform under heavy load. Until now, it’s been hard to know when throttling happens and why. The new Query Throttling diagnostics (preview) change that by emitting rich telemetry to Application Insights whenever throttling occurs. With a few KQL queries and a dashboard, customers can investigate spikes, pinpoint noisy queries, and tune workloads—without opening a support ticket.

What the telemetry includes

For each throttled event, Application Insights emits a record with fields such as:

Command (throttled query text)
CdsQueryHash (stable hash for grouping similar queries)
Reason (why throttled)
Delay (ms) and Probability (%)
Throttle expiry time
Context like operation/session IDs, environment/organization identifiers

Note: names can differ slightly by release; use the helper below to normalize.

Prerequisites

Application Insights connected to your Dataverse environment.
Permission to run KQL in Azure Data Explorer, Log Analytics, or Workbooks.

Setting up Data Export to Application Insights from Power Platform Admin Center

Power Platform Admins can use the Power Platform Admin Center to enable Data Export to Application Insights for enabling the Dataverse throttling telemetry. Note that if you have already set up a dataverse diagnostics data export, you will have to delete it and recreate a new data export to Application Insights.

FastTrack Plug and Play dashboard: Dynamics-365-FastTrack-Implementation-Assets/MonitoringAndTelemetry/DataverseThrottling at master · microsoft/Dynamics-365-FastTrack-Implementation-Assets

Quick start: KQL helper

Add this as base queries in you Azure Data Explorer workbook. It unions traces and customEvents, then exposes common fields used by the visuals.
// ------ Global parameters ------
let _lookback = 7d;     // time window for analysis
let _bin      = 1h;     // chart bin (10m/30m/1h)

// ------ Helper: DataverseThrottle ------
let DataverseThrottle = () {
    union isfuzzy=true
    (
        traces
        | where timestamp > ago(_lookback)
        | where tostring(customDimensions.throttlingAction) =~ "Throttle"
            or tostring(customDimensions.ThrottleAction) =~ "Throttle"
    ),
    (
        customEvents
        | where timestamp > ago(_lookback)
        | where name =~ "QueryThrottled"
            or tostring(customDimensions.throttlingAction) =~ "Throttle"
    )
    | extend cd = todynamic(customDimensions)
    | extend
        command       = tostring(cd.Command),
        cdsQueryHash = tostring(cd.CdsQueryHash),
        delayMs       = tolong(cd.throttlingDelayMilliseconds),
        probability   = todouble(cd.throttleProbabilityPercentage),
        reason        = tostring(cd.throttleReason),
        expiryTime    = todatetime(cd.throttleExpiryTime),
        environmentId = tostring(cd.environmentId),
        organization = tostring(cd.organizationUrl),
        userId        = tostring(user_Id),
        opId          = tostring(operation_Id),
        opName        = tostring(operation_Name),
        sessionId     = tostring(session_Id)
    | project timestamp, command, cdsQueryHash, delayMs, probability, reason, expiryTime,
              environmentId, organization, userId, opId, opName, sessionId, itemType
};

Explore the data: ready-to-paste queries

1) Throttles over time (with average delay)

Purpose: Show the volume and severity of throttling over time so you can spot patterns, regressions, and verify improvements after changes.
How it works: Counts throttle events per time bin and computes the average throttle delay within each bin. (here the bin is defaulted to 1h. change as appropriate)
Interpretation:

Rising Throttles = more frequent throttling; rising

AvgDelayMs = more impactful throttling. Both should decline after query optimizations.

Use a time chart (line/area) and plot both series.
DataverseThrottle
| summarize Throttles = count(), AvgDelayMs = avg(delayMs) by bin(timestamp, 1h)
| order by timestamp asc

2) Top throttled queries (by hash) + sample text

Purpose: Identify which logical queries cause the most throttling so you can prioritize remediation.
How it works: Groups by a stable query hash and shows total throttles, average delay, and one representative command text.
Interpretation: Start with the top hashes; high AvgDelayMs indicates heavier impact per event.

Start remediation here.
DataverseThrottle
| summarize Throttles = count(),
AvgDelayMs = round(avg(delayMs), 1),
AnySampleCommand = take_any(command)
by cdsQueryHash
| top 15 by Throttles desc

3) Reasons for throttling

Purpose: Explain why throttling occurred to guide the right fix (e.g., concurrency vs. cost).
How it works: Aggregates the reason field and ranks by count.
Interpretation: Dominant reasons point to the governing policy or resource under pressure.

Distribution of “why”.
DataverseThrottle
| summarize Throttles = count() by tostring(reason)
| top 10 by Throttles desc

4) Delay percentiles (impact)

Purpose: Quantify how much throttling slows requests for typical users and the long tail.
How it works: Computes min/max and percentiles of the applied delay across all throttle events in the window.
Interpretation: High P95/P99 means a subset of requests are significantly delayed; track these post-remediation.

Single-value tiles or a compact table.
DataverseThrottle
| summarize
    MinDelayMs = min(delayMs),
    P50DelayMs = percentile(delayMs, 50),
    P90DelayMs = percentile(delayMs, 90),
    P95DelayMs = percentile(delayMs, 95),
    P99DelayMs = percentile(delayMs, 99),
    MaxDelayMs = max(delayMs)

5) Hot windows (spike detection)

Purpose: Alert on statistically significant spikes so teams can respond quickly.
How it works: Builds a 7-bin simple moving average using prev() and flags bins where current > 2x baseline and above a minimum count.
Interpretation: Each resulting row is a spike timestamp. Drill into correlating queries, reasons, and callers for RCA.

Cluster-agnostic rolling baseline using prev().
let _multiplier = 2.0; // spike sensitivity (current > 2x baseline)
let _minCount = 5; // ignore tiny blips
DataverseThrottle
| summarize Throttles = count() by timestamp = bin(timestamp, 1h)
| order by timestamp asc
| serialize
| extend p1 = prev(Throttles,1), p2 = prev(Throttles,2), p3 = prev(Throttles,3),
p4 = prev(Throttles,4), p5 = prev(Throttles,5), p6 = prev(Throttles,6)
| where isnotnull(p6)
| extend MovingAvg = (todouble(Throttles)+todouble(p1)+todouble(p2)+todouble(p3)+todouble(p4)+todouble(p5)+todouble(p6)) / 7.0
| where Throttles > _multiplier * MovingAvg and Throttles > _minCount
| project timestamp, Throttles, MovingAvg

6) Most-affected callers / operations

Purpose: Identify which users, operations, or sessions are most impacted for targeted fixes and communication.
How it works: Groups by the first non-null identifier among userId, operation name, and sessionId; computes counts and average delay.
Interpretation: High counts and large AvgDelayMs indicate prime candidates for optimization or scheduling changes.

DataverseThrottle
| summarize Throttles = count(), AvgDelayMs = avg(delayMs)
by Caller = coalesce(userId, opName, sessionId)
| top 15 by Throttles desc

7) Recent detailed samples (triage)

Purpose: Provide a short, actionable list of the latest throttle events for on-call and RCA.
How it works: Projects key fields (reason, delay, expiry, command) and sorts by time descending.
Interpretation: Use the command snippet to trace the source; expiry helps estimate when pressure will ease.

DataverseThrottle
| project timestamp, reason, delayMs, probability, expiryTime,
opId, opName, userId,
Command = substring(command, 0, 800)
| order by timestamp desc
| take 100

8) Environment / org breakdown (optional, multi-tenant)

Purpose: Show where throttling is concentrated across environments or organization URLs for multi-tenant teams.
How it works: Groups by environment and organization, computing counts and average delay; returns the top 20.
Interpretation: Hotspots indicate localized workload or configuration issues; focus remediation there first.

DataverseThrottle
| summarize Throttles = count(), AvgDelayMs = avg(delayMs)
by tostring(environmentId), tostring(organization)
| top 20 by Throttles desc

Build a dashboard / workbook

Follow the instructions and the downloadable json file to install the dashboard : Dynamics-365-FastTrack-Implementation-Assets/MonitoringAndTelemetry/DataverseThrottling at master · microsoft/Dynamics-365-FastTrack-Implementation-Assets

How to act on the findings

Reduce query cost: fetch fewer columns, add filters, avoid broad joins, paginate.
Reschedule heavy jobs: move bulk loads/exports/reporting to off-peak hours.
Cache repeat reads: use app/service caching for hot paths.
Work the list: start with Top Hashes, confirm Reasons, validate user impact via Percentiles, and watch Trend/Spikes trend down after fixes.

Alerting (optional)

Sustained spikes: count() of throttles above a threshold per period.
Single noisy query: a specific cdsQueryHash crossing N events in M minutes.
Rising severity: AvgDelayMs or P95 delay above target.

FAQ

1. Does this replace API throttling?
No. Query throttling governs work we accept, while API throttling protects the service at the boundary (rate/concurrency). You need both layers for stability; this telemetry helps diagnose and tune queries once requests are inside.

2. Does the telemetry add overhead?
It’s lightweight, similar to other App Insights traces. The benefit—faster time-to-root-cause—far outweighs the cost.

3. Will field names change?
It’s a preview; minor name changes are possible. Use the helper function to normalize.

Closing

This preview closes one of the biggest black boxes in Dataverse performance. Turn it on, paste the queries, and give your teams the visibility to self-diagnose and fix throttling quickly.

Community site session details