Accelerate RDS troubleshooting: CloudWatch Database Insights expands globally

Written by Jacob Heinz | Jan 26, 2026 8:59:14 PM

Your database is fine… until it isn’t. One slow query. One connection leak. Suddenly your pager is doing CrossFit. The scramble begins. Where’s the bottleneck? Who owns the fix? How fast can you push it?

Good news: CloudWatch Database Insights just landed in four new regions. Asia Pacific (New Zealand), Asia Pacific (Taipei), Asia Pacific (Thailand), and Mexico (Central). That brings total coverage to 25+ regions. It uses ML to flag anomalies and bottlenecks, then offers clear fixes. Think indexing a high‑cardinality column when scans go wild. The kicker: no agents to install.

If you run Amazon RDS for PostgreSQL or MySQL, or Amazon Aurora, this is for you. In this guide, you’ll turn noisy performance mysteries into fast, repeatable fixes. And see how it plays nice with RDS Performance Insights, Calls/sec metrics, and the rest of your CloudWatch stack.

That’s the headline. The real win is speed. Fewer guessing games, faster clues, and fixes you can trust. We’ll keep it practical. How to enable it, connect the dots, and roll it out without waking your whole on‑call rotation.

TLDR

ML‑powered anomaly detection surfaces slow queries, connection leaks, saturation, and more.
Four new regions: New Zealand, Taipei, Thailand, and Mexico (Central).
Works with RDS for PostgreSQL, MySQL, and Aurora; no agent overhead.
Complements RDS Performance Insights; nothing here replaces it.
Use Calls/sec from Performance Insights for workload context; use Database Insights for guided fixes.
Roll out by environment, wire alerts, and ship runbooks you can reuse.

What changed and why

New coverage zero agent

CloudWatch Database Insights now spans 25+ regions. Fresh support in Asia Pacific (New Zealand), Asia Pacific (Taipei), Asia Pacific (Thailand), and Mexico (Central). The model is simple. It leans on CloudWatch and service‑integrated telemetry. So you avoid classic agent tradeoffs like overhead, version drift, and host access.

The promise is simple. Get on‑demand performance analysis and actionable guidance without redeploying your fleet. If teams are spread across time zones, first‑class support in these regions helps. You get fewer cross‑region hops and faster signal when things go sideways.

Why this matters beyond a map. More regions usually mean lower monitoring delay. And better alignment with how your teams and data live today. If you run multi‑account, multi‑region, you can set up consistent monitoring where workloads run. Not backhaul everything to one hub and hope.

No‑agent setups are great for managed databases. There’s nothing to install. Nothing to patch. Nothing that risks adding CPU or memory pressure to your database instances. You get insights and guidance while keeping instance resources focused on traffic.

Pro tip: strict separation helps here too. If you split prod vs. non‑prod, local‑region coverage makes blast‑radius control easier. Enable features region by region. Test in staging close to prod. Then roll out broadly with fewer surprises.

From chaos to context

Here’s the usual midnight fire drill. Latency spikes. Dashboards look fine‑ish. Everyone guesses. With Database Insights, an ML baseline flags an anomaly on a write‑heavy table. It links it to query patterns that recently changed. You get a suggestion to add an index on a high‑cardinality column driving full scans. Instead of sifting through logs for an hour, you get a strong guess in under five minutes.

The result is a new gold standard in incident response. Detection, context, recommendation. You still own the fix, but now you have a map.

Picture the flow during a real incident:

Detection: an anomaly fires for rising query latency and lock waits.
Context: the insight notes a spike in similar queries hitting the same table.
Recommendation: it suggests reviewing an index on a filter column and auditing connection pooling.
Validation: you hop into your engine and run EXPLAIN/EXPLAIN ANALYZE or EXPLAIN FORMAT=JSON to confirm the plan.
Action: you apply the index during a safe window or fix pooling, then verify the improvement.

This cuts the loop from “weird graph, let’s guess” to “here’s the likely root cause, let’s validate.” It’s a big upgrade in calm and speed.

How ML spots trouble

Behavioral baselines that adapt

Static thresholds age like milk. Database Insights builds behavioral baselines. It learns normal patterns for your workload and flags deviations. Bursts in query latency, lock waits, or connection growth that outruns app traffic. Because the baseline adapts, you avoid alert fatigue and still catch the weird stuff. Like a weekend cron job that suddenly explodes CPU.

It’s great for seasonal or launch‑driven workloads where normal shifts often. Think end‑of‑month finance runs or product drops. Those can turn a tidy read‑write ratio into a write‑heavy stampede.

Adaptive baselines shine when traffic is spiky or cyclical. They learn Monday mornings are busy. Saturday nights are quiet. So you don’t get paged for normal patterns. When an outlier pops, you see it. Retries blow up Calls/sec, locks rise faster than throughput, or connection count jumps suddenly. That’s where the anomaly shows up.

You can make these baselines work harder by pairing them with tags and SLOs. Scope alerts to the apps that matter. Then compare anomalies against your SLO budgets, like error or latency budgets. That way you’re not chasing noise.

From symptom to likely fix

Anomaly detection is the first step. Remediation guidance is the real unlock. When a bottleneck links to missing indexes on high‑cardinality columns, you get a clear nudge. When connections spike, you get a heads‑up to audit pooling and timeouts. If slow queries cluster by pattern, you can pick the highest‑impact fix first.

Use this along with engine‑native tools to validate next steps. For PostgreSQL, EXPLAIN and EXPLAIN ANALYZE confirm whether the new index changes plan shape and cost. For MySQL, EXPLAIN FORMAT=JSON is your friend. Database Insights speeds triage. Your database engine confirms the execution path.

Pro move: wire anomaly notifications into your on‑call system using CloudWatch alarms and EventBridge. Do not learn about an issue from your users.

A practical triage sequence you can reuse: 1) Check service health: is there an app deploy, config change, or spike in traffic? Annotate dashboards with deploy events from Systems Manager or your CI so you connect dots fast. 2) Check RDS Performance Insights: look at Calls/sec and top waits to understand the load story. 3) Review the Database Insights anomaly: note what changed and the recommended next move. 4) Validate in‑engine: run EXPLAIN/EXPLAIN ANALYZE (PostgreSQL) or EXPLAIN FORMAT=JSON (MySQL) to verify plan and cost. 5) Fix and verify: apply the least risky change first, like index, query hint, or batch size. Then re‑check the plan and metrics.

Performance vs Database Insights

Same mission different lenses

You might be wondering about aws performance insights deprecated rumors. Short answer: Performance Insights is not deprecated. It remains a core feature of RDS and Aurora. It shows database load, top SQL, wait events, and more. Database Insights complements it. It detects anomalies with ML and layers on targeted remediation guidance.

If you search aws cloudwatch performance insights, here’s the gist:

Performance Insights: what is consuming load and when.
Database Insights: what changed abnormally and what likely fixes it.

You use both to move from what happened, to why it happened, to what to do next.

A simple playbook:

Start with Database Insights to spot the weird change and get a likely fix.
Pivot to Performance Insights to see which waits, queries, and times drive load.
Confirm your hypothesis in the engine with EXPLAIN.
Apply the fix, then watch Performance Insights to see the load pattern settle.

This is the “observe, orient, decide, act” loop for databases. Fast and repeatable.

Metrics that matter

Calls/sec in Performance Insights is a must‑watch metric. Spikes there without matching throughput or business traffic often point to retry storms. Or inefficient batching. Or hot partitions. Tie Calls/sec to lock waits and buffer cache misses. You’ll spot saturation early.

For database insights execution plan work, keep your validations inside the database. Use EXPLAIN to compare before and after changes. Database Insights helps you pinpoint query candidates. The execution plan confirms impact.

And do not sleep on cloudwatch container insights if your database sits behind microservices. App‑level metrics like p95 latency, container CPU throttling, and pod restarts help. They can explain database‑side symptoms that look mysterious in isolation.

Other helpful signals to keep on a single pane:

CPUUtilization and FreeableMemory on the database instance: shows compute or memory constraints.
ReadIOPS/WriteIOPS and DiskQueueDepth: highlights I/O pressure and queueing.
Lock waits and deadlocks (where available): reveals contention hotspots.
NetworkTransmit/Receive (if applicable): helps when chatty services overwhelm connections.

Cross‑plot these with Calls/sec. You’ll see if your issue is volume, contention, or inefficiency.

Roll it out

Start small tag everything

Roll out Database Insights by environment. Dev, then staging, then a single prod slice. Tag databases by environment, owner, application, and cost center. That lets you scope alerts, dashboards, and budgets. So you don’t spam every team with every anomaly.

Create a runbook for the top three issues you expect to see. Missing indexes, connection pool exhaustion, and lock contention. When an anomaly trips, your on‑call has a step‑by‑step path to resolution.

A clean rollout checklist:

Pick one non‑critical environment and one representative workload.
Enable Database Insights and let it learn your baseline. Give it time to observe normal cycles.
Set up alerts with clear routes, like PagerDuty, Opsgenie, or Slack via SNS. Add quiet hours where it makes sense.
Share the runbook and do a tabletop walkthrough. Make sure every on‑call knows the dance.

For tags, standardize keys like Environment, App, Owner, SLO, and CostCenter. Use them to filter dashboards and notifications. This keeps signal tight and costs visible.

Automate the edges that matter

Even if there isn’t a dedicated cloudformation database insights resource type, you can still codify the plumbing. CloudWatch alarms, dashboards, metric math, EventBridge rules, and IAM roles. Keep it all in Git. Review it. Ship consistent monitoring with your apps.

For incident routing, send anomaly alarms to PagerDuty, Opsgenie, or Slack via SNS. For change tracking, annotate dashboards with deployment events from Systems Manager or your CI. Then you can correlate fixes with trend shifts.

Finally, train the team. A 45‑minute brown‑bag demo goes a long way. Walk through an example anomaly, the related Performance Insights Calls/sec trace, and an EXPLAIN plan. It will pay for itself on your next Friday night page.

Guardrails to add early:

Least‑privilege IAM for any automation that reads metrics or posts alerts.
Budgets and alerts for CloudWatch usage so costs don’t creep.
A naming convention for alarms and dashboards by env/app so discovery is easy.

Quick pit stop

Four new regions mean lower latency on signals and broader coverage.
ML baselines reduce noise and elevate genuinely weird performance shifts.
Database Insights gives guidance; the database engine confirms with EXPLAIN.
Performance Insights is alive and well; use it with Database Insights.
Calls/sec is your early‑warning signal for workload shifts and retries.
Automate alerts, dashboards, and routing; keep runbooks close to the team.

Add these two memory hooks:

When in doubt, correlate: Database Insights anomaly + Performance Insights Calls/sec + engine EXPLAIN.
Fixes are ranked by impact and risk: index and query shape first, config and infra second.

FAQs you will actually use

What is Database Insights

Database Insights is an ML‑powered, on‑demand database performance analysis feature in Amazon CloudWatch. It detects anomalies and suggests likely fixes, with no agents. RDS Performance Insights is a feature of RDS and Aurora that shows database load, top SQL, and waits. Use Performance Insights to understand workload shape. Use Database Insights to spot abnormal changes and get guided remediation.

Supported engines and regions

Per the latest expansion, supported engines include Amazon RDS for PostgreSQL, RDS for MySQL, and Amazon Aurora. Availability is now in 25+ regions including Asia Pacific (New Zealand), Asia Pacific (Taipei), Asia Pacific (Thailand), and Mexico (Central). Always verify engine and region coverage on the AWS Regional Services list before rollout to your environment.

Is AWS Performance Insights deprecated

No. AWS Performance Insights is not deprecated. It continues to be supported for Amazon RDS and Amazon Aurora. It remains a primary way to visualize load, waits, and top SQL. Database Insights complements, not replaces, Performance Insights.

Execution plan visibility

Use it to find and prioritize the slow or anomalous queries. For execution plans, stay in the engine. Run EXPLAIN or EXPLAIN ANALYZE in PostgreSQL. Run EXPLAIN or EXPLAIN FORMAT=JSON in MySQL. Treat Database Insights as your triage and prioritization layer for database insights execution plan work.

How does pricing work

Pricing for CloudWatch features varies by metric volumes, analytics, and logs. Review the Amazon CloudWatch pricing page for the most current details. Set budgets plus anomaly filters so costs track with value. As always, test in a lower environment first.

Automate with CloudFormation

Even if there isn’t a first‑class CloudWatch Database Insights resource, you can manage the surrounding pieces with AWS CloudFormation. CloudWatch alarms, dashboards, metric filters, SNS topics for notifications, and IAM roles. Many teams wire EventBridge rules and runbooks into stacks to keep monitoring consistent.

Baseline and tune alerts

Start with a learning window of 14–30 days so the ML learns “normal.” Then map anomaly types to your SLOs. For example, alert on latency anomalies that persist beyond a few minutes. But only page for anomalies that coincide with error spikes or Calls/sec surges.

Runbooks to prepare

Three essentials:

Missing index: identify column and filter pattern, assess cardinality, validate with EXPLAIN, add index in a safe window, re‑EXPLAIN, monitor latency.
Connection pool exhaustion: right‑size pool, set timeouts, confirm reuse, check retry logic, verify connection counts settle.
Lock contention: find blocking sessions, reduce transaction scope, add appropriate indexes, retry with backoff, re‑check waits.

First region rollout plan

Verify region support for your engines and choose a non‑critical environment first.
Enable Database Insights and set a 14 to 30‑day baseline window before tightening alerts.
Tag databases by environment, owner, and application; scope alerts accordingly.
Create CloudWatch alarms for anomaly types that map to your SLOs; route via SNS.
Build a runbook for index fixes, connection pooling, and lock triage with EXPLAIN steps.
Correlate with RDS Performance Insights, especially Calls/sec, to validate workload shifts.

Add a dry run:

Trigger a harmless test change, like a controlled load test in staging, and watch both tools react. Practice the loop: anomaly, context, validation, fix, verify.

Add “day two” tasks:

Fold insights into post‑incident reviews. Capture which suggestions worked and which patterns matter most for your app.
Update dashboards to highlight Calls/sec alongside your top SLO metrics.
Keep your runbooks short, scannable, and linked from every alert.

Wrap up

Your job is not to stare at dashboards. It’s to protect user experience and move fast on fixes. CloudWatch Database Insights gives you that first mile of clarity. ML to spot what changed, plus practical guidance to get you moving. Paired with RDS Performance Insights and a crisp EXPLAIN habit, you go from vague symptoms to targeted remediation in minutes, not hours.

Roll it out like you would any powerful tool. Start small, tag consistently, automate the edges, and document what works. The goal is a repeatable, boring process. Because boring is how you win Friday nights back from the pager.

In ops, luck is a strategy. ML is a system. Choose the system.

References

View full post