Why Nextdoor Migrated to Valkey: A Q&A with Slava Markeyev and Meet Bhagdev.

Q1. Could you start by giving us an overview of Nextdoor’s technical architecture and the specific use cases where you rely on in-memory data stores? What scale and performance requirements does your platform demand, and what role does caching and real-time data access play in delivering the neighborhood-focused experience that millions of users depend on daily?

[Slava]
Nextdoor‘s core product experience for neighbors, government agencies, and local businesses is powered by a Django-based monolith and is supplemented by a small set of purpose-built microservices and a limited set of datastores.


I describe the guiding principle behind our infrastructure decisions as that by standardizing what tools and services we use, we can reduce complexity, increase developer velocity, and better reap the benefits of core-infrastructure improvements.


To that extent, we use Amazon Aurora PostgreSQL for our relational database needs, Amazon OpenSearch for search functionality, and Amazon ElastiCache for Valkey for in-memory key-value storage.
As time has progressed, the functionality overlap between these datastores has increased; for instance, each now supports vector search. In spite of all the feature overlap, it has remained important for us to remain well aligned with the core investment areas for each datastore to ensure it fits well in its location in our infrastructure and within our specific use case.


While we have a few interesting tie-ins with Valkey in our infrastructure such as a time-bound-eventually-consistent database cache, our use cases across the board remain somewhat boring (caching, vector storage, rate limiting, deduplication, etc.). In reality, many datastores can support our key-value access patterns but the differentiating factors for us are around horizontal scalability, high availability, and cost-efficiency.
For instance, it doesn’t matter if a datastore can produce a single digit millisecond result for a KV query if Nextdoor is inaccessible for neighbors each time an action is taken due to planned (upgrades or scaling) or unplanned (hardware failure) events.

Q2. The decision to upgrade from Redis to Valkey represents a significant architectural choice. Can you walk us through the evaluation process that led Nextdoor to choose Valkey? What specific factors—whether technical, operational, licensing-related, or community-driven—made Valkey the right fit for your needs, and how did you assess the risks and benefits of making this transition?

[Slava]
For us it’s important that a chosen datastore, its core competencies, the direction in which its development is headed, as well as our cloud provider, are all well aligned. The decision to migrate from Redis to Valkey was really a no-brainer for us. 


What gave us confidence when switching was threefold:

  1. The core contributors that formed Valkey were from a diverse set of companies and the project was supported by the Linux Foundation. 
  2. It was clear the community wanted to get back to basics of improving efficiency, operability, and performance. I’m happy to say that’s exactly what we got with Valkey 8.0, 8.1, and 9.0.
  3. Our cloud provider, AWS, was on board.

As soon as we saw that there was a much better alignment, it was simply a question of when and not if we’d switch.

Q3. Migrating a critical infrastructure component like your in-memory data store in a production environment serving millions of users is no small undertaking. Can you share the story of Nextdoor’s upgrade journey from Redis to Valkey on open-source software? 
What were the biggest challenges you encountered, how did you approach testing and validation, and what strategies did you employ to ensure zero or minimal disruption to your users during the migration?

[Slava]
I would categorize our migration as relatively mundane but I do want to contextualize that because it’s a culmination of decisions and actions over years. First and foremost, a lot of the credit goes to the ElastiCache control plane for handling the operational aspects of the migration process. However, as a customer it’s obviously our responsibility to a) ensure compatibility with our applications and b) follow best practices.
As I mentioned earlier, we not only standardize what datastores we have but also the way in which we configure them, use them, and access them. However, there was a time when our applications didn’t handle Redis maintenance operations like upgrades or scaling activity very well. Instead of shying away from doing those activities, we invested a significant amount of time and effort in the following three areas:

  1. Simplifying our configuration by centralizing cluster access through Envoy Proxy, which allowed us to manage only one Redis cluster client configuration.
  2. Painstakingly testing and tuning how the Redis clients in our applications behaved under different scenarios such as retries, timeouts, and request errors.
  3. Ensuring that routine upgrades were simply part of continued investment in using Redis for caching.

As an aside, I really wish that Valkey Glide existed years ago because it would have made our lives so much easier. When we were ironing out pain points in behaviors, clients didn’t have things like backoff and jitter which made it difficult to avoid thundering herds and retry storms during brief operations like primary failovers.


In any case, through our continued investment over the years, the migration from Redis to Valkey was treated as an ordinary upgrade. For us this meant running through CI, staging, and a phased rollout across different geographies.


Nearly all our clusters were successfully migrated without much fanfare or impact to the availability or stability of Nextdoor. However, despite all of our investment in standardization and operational excellence, one application escaped our best practices.


This miss subsequently provided great organizational learnings from what went wrong and why it wasn’t caught earlier but the short story is that we had an application with an outdated client library that was wire-incompatible with clustered Redis beyond 5.x. This was simply a hidden sharp edge waiting to be stepped on independent of our migration. Luckily, we were fortunate enough that we could fall back to a previous backup as this particular cluster and application were a view on top of our data lake data.

Q4. Now that you’ve been running Valkey in production, what tangible benefits have you observed? Beyond the expected performance metrics, have you seen improvements in operational efficiency, cost optimization, developer experience, or community engagement? Were there any unexpected advantages or insights that emerged after making the switch?

[Slava]
With the memory optimizations that landed in Valkey versions 8.0 and 8.1, we observed about an 8% reduction in memory usage across different clusters. Our use cases are primarily memory bound which means that this near double digit change is a very real and very significant change to the opex for our in-memory data storage. It’s not every day that you’re basically handed free money by your data store.
Those are the tangible benefits that we saw but I do want to call out all the other changes around efficiency and operability that the community also landed. There’s definitely something to be said when your data store handles upgrades quicker and more safely or is more efficient and steadfast when it might otherwise previously have been overloaded.

Q5. From Amazon’s perspective, what makes ElastiCache uniquely suited as a managed platform for running Valkey, and how does this align with what Nextdoor experienced?
Can you both speak to the advantages of running Valkey on a fully managed service versus self-managing open-source deployments—considering factors like operational overhead, scalability, security, compliance, and the ability to focus engineering resources on core product innovation rather than infrastructure management? 

[Meet]  Our goal is to make Amazon ElastiCache the best place to run Valkey. We do this with a number of key innovations from recent years. First and foremost, you can run Valkey on ElastiCache Serverless. ElastiCache Serverless eliminates the need to manage infrastructure, letting you launch a cache in seconds, and scale automatically based on workload demand without any impact. Whether your traffic spikes during peak hours or tapers off at night, ElastiCache Serverless adjusts seamlessly—so you only pay for what you use. Specifically, with Valkey on ElastiCache serverless, you can scale from 0 to 5 million requests per second (RPS) in just a few minutes, doubling the supported RPS every 2–3 minutes. We also lowered the per-GB and per-ECPU price, reduced the storage minimum from 1 GB to 100 MB, and priced ElastiCache Serverless for Valkey 33% lower than its Redis OSS counterpart. Similarly for node-based ElastiCache, we’ve priced Valkey 20% lower than its Redis OSS counterpart. 

Beyond cost and elasticity, ElastiCache with Valkey delivers microsecond latency and supports millions of ops/sec with high availability of 99.99% built-in. You get enterprise-grade performance without managing clusters, patching software, or worrying about failover—AWS handles it for you. Since Valkey is a drop-in replacement for Redis, most developers can adopt it with zero code changes. This means you get open-source flexibility backed by AWS’s operational excellence. Finally, running Valkey on ElastiCache gives you deep integration with the AWS ecosystem—IAM for access control, CloudWatch for observability, and encryption at rest and in transit for security. 

You don’t have to take it from me though. We are seeing many customers see benefits after upgrading to Valkey. Customers across industries are already seeing measurable wins after migrating to ElastiCache for Valkey. Rapid7 reported a 40% drop in latency and over 20% savings on costs, while Swiggy achieved a 40% reduction in costs while maintaining, and in some cases, even enhancing their performance. Such customers are able to upgrade to Valkey without impact to their applications using the built-in in-place upgrade workflows. These examples highlight how Valkey on ElastiCache delivers what many teams want most: better performance, lower cost, and fewer operational surprises.

[Slava] I personally believe it’s important that platform teams and SREs are well aligned with the organizations they support. Delegating to ElastiCache provides much higher leverage because they are constantly investing in improving the operability of Valkey in both the ElastiCache control plane and Valkey itself. They’re knowledgeable and have been exposed to countless edge cases in the datastore and can quickly drive fixes with the Valkey community when needed. Yes, you can certainly do it on your own but unless your company provides in-memory data storage-as-a-service there’s definitely something better you can be doing to support your product engineering organizations.

Q6. Anything else you wish to add?

[Slava]
I appreciate that migrations and upgrades can seem risky, uncertain, and scary but they don’t have to be if operational excellence, in whatever way that means in your organization, is applied. I’m not saying everything will go perfectly but it’s an opportunity for learning and improving the status quo.

[Meet] If you are not already using ElastiCache, you can get started with ElastiCache for free by creating a cache in less than a minute by using the console, ElastiCache APIs, or the ElastiCache MCP server. If you are already using ElastiCache and are interested in upgrading from ElastiCache for Redis to ElastiCache for Valkey, you can upgrade in-place using the cross engine upgrade functionality. For all other topics, you can also visit our ElastiCache product pageand developer guide to learn more.

……………………………………………………

Slava Markeyev is a Software Engineer at Nextdoor. His motto is “There’s nothing that you cannot learn that cannot be learned. There is nothing you cannot do that cannot be done.”

Meet Bhagdev is currently the head of product management for Amazon ElastiCache and Amazon MemoryDB at Amazon Web Services. He brings over a decade of experience in cloud database and analytics services, and is passionate about building scalable, developer-friendly database systems for mission-critical cloud workloads

Sponsored by AWS 

You may also like...