Big Data Analytics at Netflix. Interview with Christos Kalantzis and Jason Brown.
“Our experience with MongoDB is that it’s architecture requires nodes to be declared as Master, and others as Slaves, and the configuration is complex and unintuitive. C* architecture is much simpler. Every node is a peer in a ring and replication is handled internally by Cassandra based on your desired redundancy level. There is much less manual intervention, which allows us to then easily automate many tasks when managing a C* cluster.” — Christos Kalantzis and Jason Brown.
Netflix, Inc. (NASDAQ: NFLX) is an online DVD and Blu-Ray movie retailer offering streaming movies through video game consoles, Apple TV, TiVo and more.
Last year, Netflix’s had a total of 29.4 million subscribers worldwide for their streaming service (Source).
I have interviewed Christos Kalantzis , Engineering Manager – Cloud Persistence Engineering and Jason Brown, Senior Software Engineer both at Netflix. They were involved in deploying Cassandra in a production EC2 environment at Netflix.
RVZ
Q1. What are the main technical challenges that Big data analytics pose to modern Data Centers?
Kalantzis, Brown: As companies are learning how to extract value from the data they already have, they are also identifying all the value they could be getting from data they are currently not collecting.
This is creating an appetite for more data collection. This appetite for more data, is pushing the boundaries of traditional RDBMS systems and forcing companies to research alternative data stores.
This new data size, also requires companies to think about the extra costs involved storing this new data (space/power/hardware/redundant copies).
Q2. How do you handle large volume of data? (terabytes to petabytes of data)?
Kalantzis, Brown: Netflix does not have its own datacenter. We store all of our data and applications on Amazon’s AWS. This allows us to focus on creating really good applications without the overhead of thinking about how we are going to architect the Datacenter to hold all of this information.
Economy of Scale also allows us to negotiate a good price with Amazon.
Q3. Why did you choose Apache Cassandra (C*)?
Kalantzis, Brown: There’s several reasons we selected Cassandra. First, as Netflix is growing internationally, a solid multi-datacenter story is important to us. Configurable replication and consistency, as well as resiliency in the face of failure is an absolute requirement, and we have tested those capabilities more than once in production! Other compelling qualities include being an open source, Apache project and having an active and vibrant user community.
Q4: What do you exactly mean with “a solid multi-datacenter story is important to us”? Please explain.
Kalantzis, Brown: As we expand internationally we are moving and standing up new datacenters close to our new customers. In many cases we need a copy of the full dataset of an application. It is important that our Database Product be able to replicate across multiple datacenters, reliably, efficiently and with very little lag.
Q5. Tell us a little bit about the application powered by C*
Kalantzis, Brown: Almost everything we run in the cloud (which is almost the entire Netflix infrastructure) uses C* as a database. From customer/subscriber information, to movie metadata, to monitoring stats, it’s all hosted in Cassandra.
In most of our uses, Cassandra is the source of truth database. There are a few legacy datasets in other solutions, but are actively being migrated.
Q6: What are the typical data insights you obtained by analyzing all of these data? Please give some examples. How do you technically analyze the data? And by the way, how large are your data sets?
Kalantzis, Brown: All the data Netflix gathers goes towards improving the customer experience. We analyze our data to understand viewing preferences, give great recommendations and make appropriate choices when buying new content.
Our BI team has done a great job with the Hadoop platform and has been able to extract the information we need from the terabytes of data we capture and store.
Q7. What Availability expectations do you have from customers?
Kalantzis, Brown: Our internal teams are incredibly demanding on every part of the infrastructure, and databases are certainly no exception. Thus a database solution must have low latency, high throughput, strict uptime/availability requirements, and be scalable to massive amounts of data. The solution must, of course, be able to withstand failure and not fall over.
Q8: Be scalable up to?
Kalantzis, Brown: We don’t have an upper boundary to how much data an application can store. That being said, we also expect the application designers to be intelligent with how much data they need readily available in their OLTP system. Our Applications store anywhere from 10 GB to 100 terabyte [Edit corrected a typo] of data in their respective C* clusters. C* architecture is such that the cluster’s capacity grows linearly with every node added to the cluster. So in theory we can scale “infinitely”.
Q9. What other methods did you consider for continuous availability?
Kalantzis, Brown: We considered and experimented with MongoDB, yet the operational overhead and complexity made it unmanageable so we quickly backed away from it. One team even built a sharded RDBMS cluster, with every node in the cluster being replicated twice. This solution is also very complex to manage. We are currently working to migrate to C* for that application.
Q10. Could you please explain in a bit more detail, what kind of complexity made it unmanageable using MongoDB for you?
Kalantzis, Brown: Netflix strives to choose architectures that are simple and require very little in the way of manual intervention to manage and scale. Our experience with MongoDB is that it’s architecture requires nodes to be declared as Master, and others as Slaves, and the configuration is complex and unintuitive. C* architecture is much simpler. Every node is a peer in a ring and replication is handled internally by Cassandra based on your desired redundancy level. There is much less manual intervention, which allows us to then easily automate many tasks when managing a C* cluster.
Q11. How many data centers are you replicating among?
Kalantzis, Brown: For most workloads, we use two Amazon EC2 regions. For very specific workloads, up to four EC2 regions.
To slice that further, each region has two to three availability zones (you can think of an availability zone as the closest equivalent to a traditional data center). We shard and replicate the data within a region across it’s availability zones, and replicate between regions.
Q12: When you shard data, don’t you have a possible data consistency problem when updating the data?
Kalantzis, Brown: Yes, when writing across multiple nodes there is always the issue of consistency. C* “solves” this by allowing the client (Application) to choose the consistency level it desires when writing. You can choose to write to only 1 node, all nodes or a quorum of nodes. Each choice offers a different level of consistency and the application will only return from its write statement when the desired consistency is reached.
The same is with reading, you can choose the level of consistency you desire with every statement. The timestamp of each record will be compared, and make sure to only return the latest record.
In the end the application developer needs to understand the trade offs of each setting (speed vs consistency) and make the appropriate decision that best fits their use case.
Q13. How do you reduce the I/O bandwidth requirements for big data analytics workloads?
Kalantzis, Brown: What is great about C*, is that it allows you to scale linearly with commodity hardware. Furthermore adding more hardware is very simple and the cluster will rebalance itself. To solve the I/O issue we simply add more nodes. This reduces the amount of data stored in each node allowing us to get as close as possible to the ideal data:memory ratio of 1.
Q14. What is the tradeoff between Accuracy and Speed that you usually need to make with Big Data?
Kalantzis, Brown: When we chose C* we made a conscious decision to accept eventually consistent data.
We decided write/read speed and high availability was more important than consistency. Our application is such that this tradeoff, although might rarely provide inaccurate data (start a movie at the wrong location), it does not negatively impact a user. When an application does require accuracy, then we increase the consistency to quorum.
Q15. How do you ensure that your system does not becomes unavailable?
Kalantzis, Brown: Minimally, we run our c* clusters (over 50 in production now) across multiple availabilty zones within each region of EC2. If a cluster is multiregion (multi-datacenter), then one region is naturally isolated (for reads) from the other; all writes will eventually make it to all regions due to cass
Q16. How do you handle information security?
Kalantzis, Brown: We currently rely on Amazon’s security features to limit who can read/write to our C* clusters. As we move forward with our cloud strategy and want to store Financial data in C* and in the cloud, we are hoping that new security features in DSE 3.0 will provide a roadmap for us. This will enable us to move even more sensitive information to the Cloud & C*.
Q17. How is your experience with virtualization technologies?
Kalantzis, Brown: Netflix runs all of its infrastructure in the cloud (AWS). We have extensive experience with Virtualization, and fully appreciate all the Pros and Cons of virtualization.
Q18 How operationally complex is it to manage a multi-datacenter environment?
Kalantzis, Brown: Netflix invested heavily in creating tools to manage multi-datacenter and multi-region computing environments. Tools such as Asgard & Priam, has made that management easy and scalable.
Q19. How do you handle modularity and flexibility?
Kalantzis, Brown: At the data level, each service typically has it’s own Cassandra cluster so it can scale independently of other services. We are developing internal tools and processes for automatically consolidating and splitting clusters to optimize efficiency and cost. At the code level, we have built and open sourced our java Cassandra client, Astyanax , and added many recipes for extending the use patterns of Cassandra.
Q20. How do you reduce (if any) energy use?
Kalantzis, Brown: Currently, we do not. However, C* is introducing a new feature in version 1.2 called ‘virtual nodes’. The short explanation of virtual nodes is that it makes scaling up and down the size of a c* cluster much easier to manage. Thus, we are planning on using the virtual nodes concept with EC2 auto-scaling, so we would scale up the number of nodes in a C* cluster during peak times, and reduce the node count during troughs. We’ll save energy (and money) by simply using less resources when demand is not present.
Q21. What advice would you give to someone needing to ensure continuous availability?
Kalantzis, Brown: Availability is just one of the 3 dimensions of CAP (Concurrency, Availability and “Partitionability”). They should evaluate which of the other 2 dimensions are important to them and choose their technology accordingly. C* solves for A&P (and some of C).
Q22. What are the main lessons learned at Netflix in using and deploying Cassandra in a production EC2 environment?
Kalantzis, Brown: We learned that deploying C* (or any database product) in a virtualized environment has trade offs. You trade the fact that you have easy and quick access to many virtual machines, yet each virtual machine has MUCH less IOPS than a traditional server. Netflix has learned how to deal with those limitations and trade-offs, by sizing clusters appropriately (number of nodes to use).
It has also forced us to reimagine the DBA role as a DB Engineer role, which means we work closely with application developers to make sure that their schema and application design is as efficient as possible and not access the data store unnecessarily.
————-
Christos Kalantzis
(@chriskalan) Engineering Manager – Cloud Persistence Engineering Netflix
Previously Engineering-Platform Manager YouSendIt
A Tech enthusiast at heart, I try to focus my efforts in creating technology that enhances our lives.
I have built and lead teams at YouSendIt and Netflix which has lead to the scaling out of persistence layers, the creation of a cloud file system and the adoption of Apache Cassandra as a scalable and highly available data solution.
I’ve worked as a DB2, SQL Server and MySQL DBA for over 10 years and through, sometimes painful, trial and error I have learned the advantages and limitations of RDBMS and when the modern NoSQL solutions make sense.
I believe in sharing knowledge, that is why I am a huge advocate of Open Source software. I share my software experience through blogging, pod-casting and mentoring new start-ups. I sit on the tech advisory board of the OpenFund Project which is an Angel VC for European start-ups.
Jason Brown
(@jasobrown), Senior Software Engineer, Netflix
Jason Brown is a Senior Software Engineer at Netflix where he led the pilot project for using and deploying Cassandra in a production EC2 environment. Lately he’s been contributing to various open source projects around the Cassandra ecosystem, including the Apache Cassandra project itself. Holding a Master’s Degree in Music Composition, Jason longs to write a second string quartet.
———-
Related Posts
– Lufthansa and Data Analytics. Interview with James Dixon. February 4, 2013
–On Big Data, Analytics and Hadoop. Interview with Daniel Abadi. December 5, 2012
–Big Data Analytics– Interview with Duncan Ross.November 12, 2012
Follow ODBMS.org on Twitter: @odbmsorg
##
Hi, one point of clarification, in MongoDB one doesn’t declare a master. The replica set members are peers and perform consensus elections to elect a primary at any given point-in-time.
So for example to set up a three node replica set one would run:*
$ mongo –host host0
> cfg =
{
_id : “mysetname”,
members : [
{_id : 0, host : “host0”},
{_id : 1, host : “host1”},
{_id : 2, host : “host2”},
]
}
> rs.initiate(cfg)
One might declare certain nodes as being secondary-only in some multi data center topologies, or in cases where one wants an analytics workload to only hit a certain slice of the cluster to assure a high QOS for the operational side of the application.
We do get requests from MongoDB users with very large clusters for new cluster management/operational features, that is certainly a high priority on the project roadmap.
* See also http://docs.mongodb.org/manual/reference/method/rs.initiate/
Hi Dwight,
Let me preface this comment with a clarification, that the intention of the article was never to do a hatchet job on MongoDB. I am sharing our experiences with the technology and how it didn’t fit into our use case of multi-datacenter, mutil-directional heavy writes.
That being said, I’m glad you clarified that in today’s MongoDB a master can be chosen automatically to fill that role and that a new master will then be chosen as needed, in case of failure.
However, that is still a Master-Slave scenario, where ALL rights only go to the single Master (of that shard/replica-set) and eventually will get replicated to the Slaves. That architecture limits your write throughput to a single server. This subsequently limits the efficiency of your hardware usage, since adding more servers, doesn’t increase both write and read throughput, unlike in C* where adding a new node to the cluster linearly increases both write and read throughput (Reference)
“A MongoDB replica set is a cluster of mongod instances that replicate amongst one another and ensure automated failover. Most replica sets consist of two or more mongod instances with at most one of these designated as the primary and the rest as secondary members. Clients direct all writes to the primary, while the secondary members replicate from the primary asynchronously. . . .If you’re familiar with other database systems, you may think about replica sets as a more sophisticated form of traditional master-slave replication. In master-slave replication, a master node accepts writes while one or more slave nodes replicate those write operations and thus maintain data sets identical to the master. For MongoDB deployments, the member that accepts write operations is the primary, and the replicating members are secondaries.”
http://docs.mongodb.org/manual/core/replication/
Furthermore, setting up multi-directional mutil-datacenter replication, with MongoDB, is not supported under it’s single master per shard/replica-set architecture. That was a very important use case for us and was a major reason for considering C*, which handles that scenario with minimal operational complexity.
Cheers,
Christos
Mongo is not without its flaws – they are numerous and substantial, and I have written about them extensively – but this is not an apples-to-apples comparison of architectures.
If you want to scale writes, you use sharding. If you want to survive failures, you use replica sets. You could argue it’s nicer to have nodes that run partitions of data which are replicated around on nodes (I certainly would) but the mongo architecture is very simple and very easy to administer.