On Multi-Region, Multi-Active Database Q&A with Chad Tindel.
Q1. Can we start off with some background on data replication and why it is growing in popularity?
A1. Users are implementing replication to achieve higher availability across multiple fault zones or failover configurations across geographically disparate locations. Also, as they implement applications being run in a globally distributed fashion they need data to be replicated globally for low-latency access.
Q2. What is a multi-Region, multi-active database?
A2. A Region at AWS is a physical location where we cluster data centers, and each Region consists of multiple, isolated, and physically separate groups of data centers called Availability Zones.
The multi-Region aspect of Amazon DynamoDB global tables means that your data is automatically replicated to two or more of these physical locations around the world and there’s no limit to the number of Regions that can participate in a global table configuration. Some customers replicate their data to four or more Regions so that they have copies of the data near their customers for low-latency access.
When we say that DynamoDB global tables is “multi-active” that means you can issue read, insert, update, and delete requests to all items (our term for “rows”) in the table from all Regions at any moment in time without having to wait for multiple Regions to replicate the write requests to achieve quorum. There are no “primary” and “secondary” Regions – all Regions are fully active all the time and DynamoDB uses a last writer wins conflict resolution mechanism if applications update the same Item in different Regions at the same time.
Q3. How do global tables in Amazon DynamoDB work?
A3. It’s important not to think of a “global table” as one entity. DynamoDB tables are always completely contained inside of a single Region with 3 copies made of every Item in 3 distinct Availability Zones in that Region. When you create a global table configuration what you are actually doing is creating multiple distinct DynamoDB tables in different Regions and establishing a replication relationship amongst all the participating Regions.
What is really happening behind the scenes is that you have separate tables in each of those Regions that operate independently from each other, and when you issue a write request in one Region that write request is handled using the typical ACID semantics within that Region. The global tables replication infrastructure in that Region is listening to the associated DynamoDB Stream for changes and it replicates those changes asynchronously to the other participating Regions. This is why the write cost in global tables is different than for a regular Regional table; the price of a replicated write capacity unit (rWCU) or replicate write request unit (rWRU) accounts for the cost of the dedicated replication infrastructure that keeps the different Regions in sync with each other. Replication is offloaded, thus separating client request and replication processing to separate fleets.
Because this replication happens asynchronously after the transaction has been committed in a Region we say that global tables are eventually consistent. If there is a conflict because the application changed the same Item (row) in multiple Regions simultaneously, the DynamoDB global tables replication infrastructure will look at all the mutations on that item and discard all but the most recent mutation, hence the “last writer wins” method of conflict resolution.
Q4. How do you architect your application to work with it?
A4. There are two main architecture models for working with global tables. The first and arguably most common is for customers who need basic Disaster Recovery (DR) capabilities and run their application in an Active-Passive or Active-Standby model. Maybe their CIO has placed a requirement that all databases need to have a DR copy stored somewhere in a geographically separate location so that in the case of a massive power grid outage or natural disaster, they can spin up their applications in a DR site or alternate Region and be up and running within the organization’s Recovery Time Objective (RTO). These users are only reading and writing to a single Region of DynamoDB at any moment in time. In the event of an outage they will either manually spin up application instances in the DR Region or perhaps they have the application already running in that Region and they just update DNS (or use something like Route 53 health checks) to start routing traffic to the DR Region. The point is they are just using global tables because it makes it so easy to replicate their data to the DR Region, literally just a button click to start doing it even for an already existing application Table, but they aren’t utilizing the “multi-active” capabilities of DynamoDB.
The second application architecture is the true multi-Region, multi-active deployment model, where you have your application actively running in multiple Regions simultaneously and use something like Route 53 Geolocation/Geoproximity/Latency routing policies to route your customers’ requests to the most appropriate Region. With this architecture the application instances should always route their requests to the DynamoDB Table in the local Region and if there is some sort of issues inside that Region, just stop routing traffic to the application instances in that Region entirely until it the issue is sorted out.
Q5. What is the performance and availability of global tables?
A5. The availability SLA for a single Region of DynamoDB is 99.99% (“4-nines”) and when you add at least one more Region the availability SLA increases to 99.999% (“5-nines”). Because global tables replication operates on its own dedicated fleet and uses asynchronous replication there is no performance impact to application requests! Reads and writes within a Region still occur in that same predictable single-digit millisecond response time with global tables that you get with a single-Region Table.
Q6. What makes DynamoDB unique amongst the AWS Database offerings?
A6. Our purpose-built portfolio of database engines support a diverse set of data models and use cases. Some customers require multi-Region, multi-active capabilities and DynamoDB global tables is uniquely purpose-built to offer this to our customers. The multi-active capabilities of DynamoDB mean you will see that consistent single-digit millisecond response time for write requests in all Regions that are participating in the global table configuration.
Q7. Can you share some customer examples?
A7. As you can imagine global tables are used by the biggest banks, telcos, healthcare providers, governments, airlines, and social media companies today. And of course, Amazon itself is one of the largest users of DynamoDB and global tables where we use it for thousands of different microservices internally both on the Amazon.com retail side and as a tier-0 service powering other AWS services. See Jeff Barr’s blog post about the 2022 Prime Day where he says that Amazon sources alone peaked at 105.2 million requests per second to DynamoDB. That is an astounding number for a single customer’s database usage in a cloud database service.
One of the most recognizable public references is Disney who use global tables for several different microservices in their recently launched Disney+ service, like storing their customers’ Watch List. That list has to be available quickly so it can populate the customers’ home screen every time they launch the application. Since people travel they want that data distributed globally for fast access for all customers. Another example is the Bookmark service that records where the customer is in a particular piece of content so they can pause/resume it later on the same or a different device. You can imagine how high the write throughput is for constantly updating the current timecode of all users that are currently watching media, and DynamoDB is a perfect fit for that high throughput use case.
Another great example that we’re all probably familiar with is Zoom. When the Covid-19 pandemic hit in March 2020 and the world had to pivot all at once to online meetings for work, school, and socializing with friends and family, Zoom received a big share of that increased demand for voice and video services. Their daily usage rapidly grew from 10 million daily meeting participants to 300 million daily meeting participants and they managed that surge in large part due to DynamoDB’s high-performance at any scale to keep track of all the active participants in all of their meetings. For Zoom, knowing which participants are in an active meeting is as mission-critical as a workload gets so in a sense they’ve really bet their business on DynamoDB and AWS.
Qx. Anything else you wish to add?
Ax. It’s so easy to get started using global tables! There is a built-in Cloudformation resource if you’re launching a new table and want it to be global from the start, or if you already have an existing table it’s literally just a matter of clicking a couple of buttons in the console to create replicas in other Regions. Give it a try and I think you’ll agree with me that it just feels like magic!
Chad Tindel is currently a Principal NoSQL Solution Architect at Amazon Web Services and host of the popular podcast “Can I get that software in blue?”. He holds a BS in Computer Science from Cal Poly San Luis Obispo as well as an MBA and MS Finance from the University of Denver. After 12 years of writing code professionally, Chad discovered that he really loved working with customers and has since worked at a slate of very successful companies as a Solution Architect: Red Hat, Cloudera, MongoDB, Prevoty, Elastic, and AWS.
Sponsored by AWS