Battle-Test Failover with DynamoDB MRSC and FIS Global Tables

Written by Jacob Heinz | Feb 2, 2026 8:49:27 PM

Bad day scenario: a Region blips, your cart service faceplants, customers rage-tap refresh. Most teams pray. You’ll do better—you’ll break stuff on purpose first.

Amazon DynamoDB global tables with multi-Region strong consistency (MRSC) plus AWS Fault Injection Service (FIS) lets you simulate ugly failures—Regional outages, network partitions, latency spikes—and watch your app hold the line. The result: sub-second RPO/RTO and zero data loss by design, proven by chaos.

This is the upgrade from “we think it works” to “we proved it under fire.” You’ll script failures, track replication and write conflicts, and practice failover until it’s boring. Then ship.

“Everything fails, all the time.” That’s Werner Vogels. Your move is making failure uneventful.

So treat resiliency like security and performance: not a checkbox, a drill. Write down what “healthy” means, inject pain on purpose, and force your system to show how it behaves when the universe is rude. That’s how you turn scary unknowns into boring playbooks.

We’ll keep it practical. You’ll pair DynamoDB global tables with FIS, wire up CloudWatch, and run short game-day experiments that target the gnarly stuff—replication lag, conflicts, failover, hot keys. You’ll learn what breaks and how to fix it, fast. When a Region sneezes, your customers shouldn’t even sniff it.

Key Takeaways
Use FIS to inject Regional outages, AZ failures, and network latency against DynamoDB global tables (MRSC) without touching prod traffic.
Validate replication lag, conflict resolution, and failover paths; aim for sub-second RPO/RTO.
Choose single-writer or multi-writer; understand last-writer-wins and conditional writes.
Monitor CloudWatch metrics and DynamoDB Streams; document a repeatable game day runbook.
Costs are pay-per-use: DynamoDB per read/write/storage/replication; FIS billed per experiment duration.

What MRSC and FIS Provide

MRSC in Plain English

Multi-Region strong consistency (MRSC) means your writes commit across Regions with a clarity contract designed to deliver zero data loss and sub-second recovery goals. With DynamoDB global tables (v2), you replicate data across Regions at petabyte scale with single-digit millisecond performance. The MRSC twist: you’re proving a Region going dark won’t drop writes or return stale reads where it matters. You’re designing for reality, not hope.

“In distributed systems, everything fails, all the time.” — Werner Vogels. MRSC turns that from a scary quote into an engineering constraint you can test.

Think of MRSC as a discipline layered on top of global tables: you pick the items and flows that need correctness above all (carts, balances, inventory), then enforce invariants with conditional writes, idempotency, and conflict-aware logic. You pair that with near-real-time replication across Regions and drill until failover is muscle memory. The destination is boring reliability: your app behaves the same no matter which Region takes the write.

A practical mental model:

Critical paths get guardrails (ConditionalExpression, version attributes, dedupe tokens) so timestamp-based conflict resolution never silently violates business rules.
Observability tells you how quickly replicas converge and where conflicts cluster.
Game days harden your code paths and the people paths—alerts, runbooks, and on-call.

Why FIS Matters

AWS Fault Injection Service lets you run safe, controlled chaos: pause a Regional dependency, add latency between Regions, or disrupt availability zones. Target DynamoDB global tables and related infrastructure, then measure outcomes with CloudWatch. You can simulate:

Region-level outages and controlled failover switchovers
Network partitions that trigger replication backlogs
Latency spikes that punch your P99s

A real-world example: you run a 30-minute FIS experiment injecting 300 ms cross-Region latency while hammering a hot partition (e.g., inventory item). You observe replication health, application retries, and conditional write behavior. If carts still commit and reads stay fresh across Regions—you’re ready.

Pro tip: combine FIS events with load gen (e.g., distributed load test) to stress read/write paths while faults unfold.

Safety rails to use every time:

Run in a staging environment that mirrors prod traffic patterns before you touch prod.
Scope FIS actions tightly with IAM and resource tags; use experiment templates so nobody fat-fingers a Region-wide blast.
Time-box experiments (15–30 minutes), add CloudWatch alarms as stop conditions, and require approvals.
Annotate timelines so you can correlate metrics to fault windows.

Replication Design Choices

Choosing Your Write Topology

You’ve got two main patterns:

Single-Region writer, multi-Region readers: simpler mental model, fewer conflicts, straightforward failover.
Multi-Region writers (bidirectional): ultra-low write latency everywhere, but you must tame conflicts.

Both ride DynamoDB global tables. For e-commerce, single-writer is often enough; for gaming leaderboards or social features with global users, multi-writer can win on latency.

Add a quick decision hack:

Pick single-writer if most writes start in one geography, correctness is king, and you can tolerate a controlled writer failover.
Pick multi-writer if users write globally and P99 write latency is a business KPI. Budget extra effort for conflict detection, idempotency, and reconciliation.

Conflict Resolution Without Drama

DynamoDB global tables resolve concurrent updates with last-writer-wins based on timestamps. That’s predictable and fast, but you need idempotency and app-level checks where correctness beats speed. Use ConditionalExpression to guard invariants—e.g., version numbers or balance deltas—so you fail fast rather than silently clobber.

Example: a fintech ledger uses conditional writes to enforce monotonic sequence numbers. If conflict hits, you catch ConditionalCheckFailed and retry with a deterministic rule.
DynamoDB Streams give you a replayable change log. Tap them for reconciliation jobs or alerts when write skew appears.

Add GSIs carefully: global secondary indexes replicate across Regions too, so write amplification grows. Hot keys? Use sharding patterns and adaptive capacity.

Expert note from the AWS Builders Library: retries can amplify traffic during incidents; add jittered backoff and sensible timeouts so your recovery doesn’t DDoS yourself.

A few more hard-won tactics:

Store a client-side idempotency key (like orderId) and write it with a conditional expression (attributenotexists) to prevent duplicates during retries.
Track a “version” attribute and require version = expected for updates. On mismatch, re-read and reconcile.
For counters or balances, prefer transactional patterns or write-invertible deltas you can replay or compensate.
Keep item size lean (<400 KB) and avoid high-churn attributes; smaller items replicate faster and cheaper.

Run Chaos Experiments

Faults to Inject

Start with the high-signal, low-risk tests:

1) Regional outage simulation: disable a Region’s dependencies so your app must fail over reads/writes to a healthy Region. 2) Cross-Region network latency: inject 200–500 ms and observe replication queues, write latencies, and client timeouts. 3) AZ failure: take down an Availability Zone and ensure your provisioned capacity or On-Demand scaling stays steady. 4) Partial partition: block specific service endpoints to mimic asymmetric failures.

Netflix’s chaos mantra applies: “The best way to avoid failure is to fail constantly.” Do it safely.

Turn that into a repeatable template:

Baseline: capture 10 minutes of steady-state metrics (success rates, P95/P99, backlog = 0, minimal conflicts).
Kickoff: start the FIS experiment and immediately tag the timeline.
Escalate: if results are stable, bump latency or duration on the next run—not in the same run.
Rollback: define automatic stop conditions tied to error rates or SLO breaches.

Metrics to Watch and Why

Wire CloudWatch alarms and dashboards before you push the red button:

Write success rate, system/user errors, throttled requests
ConditionalCheckFailed counts (conflict pressure)
Tail latencies (P95/P99) for reads/writes by Region
Replication health and backlog trends (watch table/replica status and Streams consumer lag)
Application SLOs: checkout success rate, time-to-consistency for hot items, leaderboards freshness

Example experiment: 20-minute latency injection between us-east-1 and eu-west-1 during peak-like load. Expected result: multi-writer keeps local writes snappy; conflicts spike slightly but remain bounded; replication backlog clears <60s post-test; customers see consistent balances and carts.

Automate teardown and add annotations so you can correlate faults to metrics later. If you can’t measure it, you didn’t test it.

Observability pointers:

Break down metrics by partition key class (hot vs normal) to spot skew.
Track DynamoDB Streams consumer lag; if lag climbs during a fault and doesn’t recover, your downstream jobs may be the bottleneck.
Add synthetic checks that perform a real read-after-write in both Regions for a known test item.

Pocket Notes

Strong consistency across Regions is a design choice, not a wish. Test it.
Prefer single-writer for simple correctness; use multi-writer when latency is king.
Last-writer-wins is fine—until it isn’t. Add conditional writes for invariants.
FIS makes failure repeatable. Start with latency and Regional outage drills.
Watch conflicts, P99s, and backlog. Your SLOs don’t care about averages.

Limits and Costs

Real World Limitations

Partition keys and hot items: MRSC doesn’t save you from poor key design. Distribute writes.
TTL behavior: expirations are local events; design cleanup jobs that don’t assume cross-Region TTL symmetry.
Item size and GSI write amplification still apply globally. Every replica pays.
Schema drift across Regions? Don’t. Keep table definitions, capacity modes, and GSIs aligned.
Operational blast radius: guardrails in IAM and change control—misconfig in one Region propagates fast.

For "dynamodb global tables limitations," check official docs to validate specifics in your stack.

Add a few more sharp edges to respect:

Per-partition throughput still matters; hot partitions throttle regardless of global footprint. Use good keys and, if needed, write sharding.
Streams ordering is guaranteed per partition key. Don’t assume global order across keys; design consumers accordingly.
Quotas apply per account/Region (throughput, tables, Streams consumers). Know them before game day.
Backfilling a new replica Region can take time depending on data size. Plan migrations outside peak and monitor status.

Price and Savings

“dynamodb global table price” boils down to:

You pay for replicated writes in each Region, reads, storage, Streams, and cross-Region data transfer.
FIS is billed by experiment duration; keep tests sharp and time-boxed.
On-Demand vs provisioned: On-Demand is great for unpredictability; provisioned + auto scaling can trim steady-state spend.
Reduce write amp: minimize attribute churn, compress heavy payloads client-side, and avoid redundant GSIs.

Example cost control: if your multi-Region game leaderboard writes 10x/sec per shard, consider local aggregation with periodic cross-Region consolidation. Fewer, larger writes can be cheaper than a firehose of small ones.

More levers worth pulling:

Cache read-heavy views at the edge (CloudFront, regional caches) to shave cross-Region reads.
Batch low-priority writes or reconcile in bulk via Streams consumers during off-peak.
Drop unused attributes and indexes; every byte and index update multiplies across Regions.

Game Day Playbook

Before the Blast

Define steady state: what does “healthy” mean for carts, ledgers, or leaderboards? Set SLOs.
Pick scenarios: outage, latency, partition. Write hypotheses and expected outcomes.
Create FIS templates targeting DynamoDB global tables and relevant infra. Lock IAM down.
Pre-wire dashboards, alarms, and on-call responders. Announce your game day.

Add acceptance criteria you can check in minutes:

RPO ~ 0 for critical writes (no lost confirmed orders).
RTO under your business target (e.g., write failover < 60 seconds, read failover < 500 ms).
Replication backlog drains to baseline within N minutes after the fault.
Conflict rate stays below a defined threshold and all conflicts are auto-resolved or reconciled.

During and After

Start small: short-duration latency injection at off-peak.
Observe in real time: replication health, conflicts, client error rates, and P99 latency.
Enforce timeouts, retries with jitter, and circuit breakers. Verify graceful degradation paths.
End experiment, snapshot metrics, and run a blameless postmortem.

First-hand style example: an e-commerce team runs a 15-minute Regional outage sim. Expected: carts write to the secondary Region within 1s; no lost orders; read paths switch under 500 ms. Actual: a background job flooded retries, spiking P99s. Fix: add jitter and bounded retries. Retest. Green.

Put this on a quarterly cadence. If you don’t practice failover, you don’t have failover.

Common postmortem actions that move the needle:

Tighten timeouts to match real P99s, not wishful thinking.
Add circuit breakers around least-reliable dependencies.
Reduce batch sizes in hot paths to cut item churn and write amp.
Improve alarms to fire on user-impacting symptoms (checkout success) not just low-level noise.

FAQ

Practical MRSC FIS example

Run a 20-minute cross-Region latency injection (e.g., 300 ms) during controlled load. Verify write success, conflict counts, replication backlog clearance under 60 seconds, and user SLOs (checkout success, balance accuracy). Then simulate a Regional outage to validate automated failover.

Conflict resolution in multi-writer

DynamoDB global tables use a last-writer-wins approach based on timestamps. To protect invariants (balances, idempotent operations), use conditional writes and application-side reconciliation. Track ConditionalCheckFailed exceptions and handle retries with jittered backoff.

Key global tables limitations

Plan for write amplification across Regions, TTL expirations being local, and the need for consistent table/index definitions in every Region. Hot partitions remain a risk; design keys to distribute load. Review service quotas before game day.

Streams for MRSC testing

Streams aren’t mandatory to replicate, but they’re extremely useful for observability, reconciliation, and building integrity checks. Use Streams to validate ordering assumptions and to trigger compensating logic after conflict spikes.

Practical costs

DynamoDB charges for reads/writes/storage in every Region, plus replicated writes and data transfer. FIS charges by experiment duration. Start with short, targeted drills; prefer off-peak windows and keep experiments measurable to control spend.

Single writer with global reads

Yes. Single-Region writer with multi-Region readers is common. You’ll failover the writer on Regional issues and keep read replicas hot globally. It’s simpler, cheaper, and avoids most conflicts.

Chaos without production

Yes—start in a pre-prod environment that mirrors production traffic patterns and scale. Once you can predict outcomes, promote to carefully scoped, off-peak prod experiments with strict stop conditions and approvals.

Schema migrations and globals

Keep schema, capacity mode, and GSI definitions identical in every Region. Roll out changes gradually and monitor replication and write errors; mismatched definitions cause surprises. Plan index backfills off-peak.

Run an MRSC Drill

Define SLOs and steady state for one mission-critical flow.
Choose a scenario: Regional outage or 300 ms cross-Region latency.
Create an FIS template targeting your DynamoDB global table and dependencies.
Set CloudWatch dashboards and alarms for latency, conflicts, and replication health.
Run a 15–30 minute drill with realistic load. Annotate the timeline.
Capture metrics, fix bottlenecks (timeouts/backoff, hot keys), and schedule the next run.

Wrap-up thought: you don’t rise to the occasion—you fall to your runbook. Make it good, then make it muscle memory.

In the end, resiliency isn’t a feature; it’s a habit. MRSC with DynamoDB global tables gives you the architecture. FIS gives you the reps. Your job is to run the play until failover feels boring. When a Region sneezes, your customers shouldn’t notice. That’s the bar.

References

DynamoDB Global Tables (v2) overview: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GlobalTables.html
DynamoDB Streams: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html
Monitoring DynamoDB with CloudWatch: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/monitoring-cloudwatch.html
DynamoDB pricing (including global tables): https://aws.amazon.com/dynamodb/pricing/
AWS Fault Injection Service (FIS): https://aws.amazon.com/fis/
FIS pricing: https://aws.amazon.com/fis/pricing/
Supported services in FIS: https://docs.aws.amazon.com/fis/latest/userguide/fis-supported-services.html
AWS Builders Library — Timeouts, retries, and backoff with jitter: https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/
Netflix Tech Blog — The Simian Army: https://netflixtechblog.com/the-netflix-simian-army-16e57fbab116
Werner Vogels on failure in distributed systems: https://www.allthingsdistributed.com/2016/11/10-lessons-from-10-years-of-aws.html
DynamoDB Time to Live (TTL): https://docs.aws.amazon.com/dynamodb/latest/developerguide/howitworks-ttl.html
DynamoDB adaptive capacity: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/AdaptiveCapacity.html
DynamoDB service quotas and item size limits: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ServiceQuotas.html
Global secondary indexes: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GlobalSecondaryIndexes.html
FIS experiment templates: https://docs.aws.amazon.com/fis/latest/userguide/experiment-templates.html
AWS Well-Architected Framework — Reliability Pillar: https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/welcome.html

View full post