On Streaming Analytics. Q&A with Saurabh Dutta
Q1. What are the technical challenges in building and deploying streaming analytics applications?
Building and deploying streaming analytics application has its own set of challenges. Some of the key ones include:
1. Choosing from a gamut of technologies – For building any streaming application there are a huge number of options and multiple technologies present. The challenge is to really choose the right set of technologies based on the needs of the application.
2. Integration with existing systems – The existing solutions and infrastructure are already in place in the large organizations, and now when you create a new streaming analytics application, it has to integrate and fit well taking in consideration the existing systems.
3. Testing and deployment – As most of the streaming analytics applications run in a distributed fashion, it is always difficult to test them for accuracy and data consistency. Hence, it is very important to test them before they go into production and there has to be minimum human interaction before promoting and deploying them to production environment.
4. Monitoring applications – Once the application is up and running in production, it has to be continuously monitored performance and uptime.
5. Handling failover scenarios – Your application components and infrastructure is bound to go down. So, while building these applications you should handle these scenarios, and post deployment you should have all the checks and balances to avoid these failure cases.
Q2. What is the most effective way to use Apache Spark Streaming and Apache Storm for analytics of streaming data?
Apache Spark Streaming and Apache Storm are two of the most popular streaming frameworks and both have their own applications. While Apache Storm focuses on processing individual events as they occur, Apache Spark focuses on micro batches. With the latest releases, Apache Spark has really filled many gaps with introducing concepts like continuous processing. So, in use-cases where latency is unacceptable, Apache Storm should be considered, while in cases where a bit of latency is okay, Apache Spark is the best fit. Apache Spark today is also the most popular Streaming framework due to its rich feature-set like advanced analytics, SQL query interfaces and support for both batch and streaming applications.
Q3. How do you handle multiple streaming from different sources of data?
In almost any use case you’ll always encounter situations where you’ll have to join data from multiple sources. In most of the cases the challenges are due to differences in data formats, velocity of data, inconsistency in data attributes, accuracy and integrity of the data itself. Therefore, for handling multiple streams you should focus on converging data from these streams into a uniform data format, water-marking to handle late data events, support for metadata stores for entity definitions and ensuring data quality.
Q4. Can you perform Machine Learning dynamically on Data Streams? If yes, how?
Yes, it is possible to perform machine learning dynamically on data streams. You have to implement an architecture with support for both batch and real-time pipelines for achieving dynamic machine learning. The batch pipelines should be used to train the models, while the real-time pipelines for scoring. For supporting dynamic machine learning every time the batch pipeline is triggered with an updated dataset and a model gets trained, the previous model should be automatically replaced by this newly trained model. In this way, your predictions evolve and become more accurate over period of time.
————
Saurabh Dutta, Product Manager, Impetus Technologies
Saurabh leads multiple engineering and R&D efforts for new and upcoming features in StreamAnalytix. He is one of the early team members who bootstrapped the product StreamAnalytix. His areas of expertise include big data, advanced analytics and cloud computing. He is responsible for analyzing customers’ business challenges and create generic solutions to fit across industries and domains.