On Apache Kafka®. Q&A with Gwen Shapira
Q1. What are the most important changes to Kafka in the last few years?
The last few years have been an incredibly busy time for the Apache Kafka® community. The community grew quite rapidly. We currently have more than 19,500 members in Apache Kafka related meetups, over 1,500 pull-requests committed in 2017 and 160 percent increase in Kafka improvement proposals – 76 significant improvements suggested, accepted and implemented in 2017.
With increasing adoption of Apache Kafka by large enterprise companies, we saw the requirements for Apache Kafka shift toward providing more security, reliability, usability, data integration and advanced data processing. To meet these new requirements, the Apache Kafka community added major functionality in the last few years:
- Security: Authentication, authorization and encryption capabilities were added in 2015.
- Data Integration: The Kafka Connect API and a high level interface for data integration were added toward the end of 2015. Followed by rapid contribution of connector plugins by the community. Today, there are over 50 open source connectors, integrating Apache Kafka with all the common databases and big data stores including Oracle, MySQL, MongoDB, Cassandra, Elasticsearch, HDFS, S3 and many more.
- Stream processing: Kafka’s Streams API, which provides advanced data processing capabilities, was added in early 2016, opening the door for new stream processing use-cases such as real-time data enrichment, recommendation systems and real-time price alerts.
- Reliability: In early 2017, we introduced Exactly Once Semantics. This includes an idempotent producer, which eliminates duplicates caused by client retries, and more importantly, it introduces transactional capabilities. This major improvement allowed financial institutions to adopt Apache Kafka for mission-critical use cases.
- Usability: In mid 2017, Confluent introduced KSQL, a SQL-like stream processing layer for Apache Kafka. While this is still in early adoption stages, we hope that this improvement will make stream processing accessible for large numbers of existing developers who are familiar with SQL.
Q2. Is it better to use Kafka or Flume to load data to Hadoop clusters?
Starting as early as 2014, this was a common question from customers. And back then, it was a difficult question. Kafka had much better reliability guarantees and performance.
Flume, on the other hand, had better integration with Hadoop and was much easier to configure for this use case. The fact that each technology brought different critical capabilities to the table, caused me to develop Flafka (Kafka integration inside Flume). This was 3-4 years ago. Now, with the addition of the Kafka Connect API and the open source connectors, Kafka has reliability, performance and great Hadoop integration. The engineers in Pandora and many other organizations agree with me – this is one of the most common Kafka use-cases.
It is important to remember that getting data to Hadoop is only one of the many uses of Apache Kafka. Apache Kafka also supports messaging, microservices, stream processing and data integration for 50+ databases.
Q3. Last year Confluent announced KSQL, a Streaming SQL Engine for Apache Kafka. How does it differ from other existing SQL extensions for Data Analytics?
We can look at the existing SQL for Data Analytics offering as belonging to 3 broad categories:
- Batch ETL: The most well known example in this category is Apache Hive. These engines are built to process many terabytes of data over the course of several hours and are typically used in large clusters to pre-process all the data accumulated over long period of time into formats and structure that match the reporting needs of the organization.
- Interactive data exploration: Presto and Impala are good examples for SQL engines in this category. These engines are built to allow data analysts to slice and dice existing data in order to gain insights regarding the data as it exists in that particular moment in time.
- Stream processing: KSQL fits into this category. These engines are built to process data continuously, in real time as it is being collected. They can be used for real-time ETL, to power real-time dashboards or even to built real time applications, fraud detection for example. Even though the stream processing SQL engines can be used for ETL and data analysis, they are different from the former categories because they are not built for ad-hoc exploration of existing data, nor are they optimized for periodic processing of large amounts of accumulated data. Rather, they will query the data as it is collected.
Because the use of SQL for stream processing is rather a new field, there are differences between the various engines that can be significant when used in practice. In my opinion, the most important differences are whether the streaming engine has the right high-level abstractions that make the SQL-like language expressive enough for your applications and whether the underlying engine was built for stream processing. Let me explain what I mean with two examples:
- KSQL has high level abstraction for both a stream of events and for a table representing state at a specific point in time. For example, if I have an online store, and a user ordered a book and later returned it, a stream abstraction will show two actions – order and return, a table abstraction will show no change in the state of the inventory after these two actions took place. KSQL allows modeling the same underlying Kafka topic as either a stream of events or a state-table. A user can choose which abstraction to use that best matches his use-case: Tables are often used to represent slowly changing dimensions, which are used to enrich the data in the stream of events.
- Because stream processing queries are meant to run continuously and process all incoming data as it arrives, it is important that the underlying engine allows for dynamic scaling, adding and removing processing resources, of the query while it is running. Some stream processing engines show their batch origins and don’t allow for this dynamic scaling, which makes it much more difficult or more expensive to work with fluctuations in the rate at which data arrives.
So, while it looks like there is a large collection of mostly identical SQL engines out there, the differences are quite profound and it is important to match the choice of engine to the planned use case.
Q4. KSQL is part of Confluent Open Source and licensed under the Apache 2.0 license. Who is currently using it and for what?
The most common use case of KSQL, by far, is real-time ETL. This can be loading log files to Kafka, using KSQL to remove unnecessary fields and then loading only the important events to Splunk or Elasticsearch. Or loading user profiles from a relational database to Kafka, joining them with user activity events reported to Kafka by a mobile application, aggregating levels of user-activity per hour for each user and loading the results to Grafana for exploration by the business analysts.
Data exploration is also a popular use case. While Kafka isn’t optimized for ad-hoc queries, at the beginning of a data analysis project there aren’t huge amounts of data yet and it all fits into memory. Being able to query the data and explore the different variations of events can help kick-start the project.
Q5. What is the road map ahead for Kafka?
One of the things that makes Kafka such a great open source project is that everyone can contribute improvements and make suggestions. The Kafka Community has a Kafka Improvement Proposal process where the entire community discusses the improvement suggestions and vote on the proposals. We can see all the improvement proposals and the discussions in the Kafka wiki.
Looking at these proposals, we can see several exciting improvements in the near-term roadmap: faster startup times, security improvements such as OAuth authentication, better scalability with a proposal to support dynamic changes in topic partitions and many more API updates.
In order to try and predict longer term improvements, we can look at the trends we are seeing in the community and the industry. Kafka is rapidly moving past the early adopters stages into business-critical use cases in more traditional industries. This changes the priorities for the community as Kafka now has to run for wide variety of use cases, at scale, by teams that are not necessarily Kafka experts. We are focused on making Kafka ever more secure and reliable and we want to make sure it is easy to scale globally across many different teams inside the organization. We also want to make sure it integrates well with all existing enterprise solutions. And as regulatory requirements become more complex, especially around privacy, we want to make sure Kafka has the regulatory compliance to really be the central data system for large organizations.
Of course, many companies choose to avoid running their own software. Maybe the biggest trend of recent years has been the adoption of cloud solutions in nearly every industry.
In order to help with this transition, Confluent runs a Kafka as a service in the cloud, allowing developers to focus on their use cases and creating business value with streaming data rather than on running Kafka itself.
Gwen Shapira is a principal data architect at Confluent helping customers achieve success with their Apache Kafka implementations. She has 15 years of experience working with code and customers to build scalable data architectures as well as integrate microservices, relational and big data technologies. She currently specializes in building real-time, reliable data processing pipelines using Apache Kafka. Gwen is an author of “Kafka – the Definitive Guide,” “Hadoop Application Architectures,” and a frequent presenter at industry conferences. Gwen is also a committer on the Apache Kafka and Apache Sqoop projects. When Gwen isn’t coding or building data pipelines, you can find her pedaling on her bike exploring the roads and trails of California, and beyond.