Q1. What are the main lessons you have learned from managing large Hadoop clusters at scale?
Several key lessons stand out.
Lesson one: I recommend buying the right tools up front to automate processes, eliminate manual scripting and reduce dependency on skilled programmers. Don’t plan on scripting and maintaining that code indefinitely. You can greatly increase efficiency and agility by using applications and platforms that generate, regenerate, and maintain the necessary data structures.
Lesson two: Enterprises should use different zones for different workloads to isolate HDFS- or AWS S3-based analytics from the transformation of data that will be fed into a data warehouse. At Attunity, we work with many organizations that are creating pipelines that prepare data for analytics across multiple zones within the data lake.
An example of this is a large automotive parts supplier who is realizing success by transforming data in sequential AWS S3-based buckets before analyzing it in Vertica and AWS Redshift.
Lesson three: Data governance is of paramount importance, as evidenced by the recent headlines regarding Facebook and Cambridge Analytica, as well as the Global Data Protection Regulation (GDPR) for citizens of the European Union. Attunity finds the most successful enterprises are creating the right policies and processes at the outset to cleanse, organize and secure data, in particular any data related to Personally Identifiable Information (PII).
Q2. What is the value of streaming data ingest with Kafka?
Attunity works with a number of enterprises that are very focused on this. Apache Kafka creates compelling opportunities to capitalize on the perishable value of data. By creating a message stream of live database transactions, our customers can support a variety of real-time analytics use cases, such as location-based retail offers, predictive maintenance and fraud detection. Kafka provides an efficient, high-performance platform to feed analytics engines such as Apache Storm and Spark Streaming, etc. to support these use cases. High volumes of messages, carrying real-time updates from databases, IoT sensors and other sources, can be reliably produced, persisted and re-played in ordered sequence. This is a flexible system that enables message producers and consumers to operate independently of one another so that end points can be added or dropped as needed.
We also see many organizations implementing Kafka to reliably and efficiently divide database transaction streams among multiple big data targets. For example, some companies will have different data lake zones subscribe to different message topics, with each topic assigned to a distinct ERP database. In these cases, Kafka acts like a railway switching system, directing different train cars to different destinations.
Kafka can even serve as a new system of record because messages are persisted. Ben Stopford at Confluent makes an interesting observation in his book Designing Even-Driven Systems that “a messaging system optimized to hold datasets [might] be more appropriate than a database optimized to publish them.” Data subscribers can use certain message subsets, drop off and come back for more as needed.
Q3. How do you turn databases into live feeds for streaming ingest and processing?
Attunity Replicate enables enterprises to automatically publish live transactions from databases, mainframes and other sources into message streams based on Kafka or variants like Azure Event Hubs, MapR-ES, etc. The solution provides an intuitive graphical interface for configuring data streams from these heterogeneous systems, then manages them at scale as they feed various targets for real-time analytics. Attunity Replicate makes any database available for any application, on an agile and as-needed basis, with little or no impact on production.
It also injects schema changes into these message streams to ensure analytics results are based on the very latest source data structures.
Q4. What is your take on Apache NiFi?
Apache NiFi is an amazing solution. Attunity has partnered closely with Hortonworks so that our customers can take full advantage of the technology. This open-source system was built to automate and manage enterprise data flows in real-time. With NiFi you can collect, curate, analyze and act on data, and use an intuitive drag-and-drop visual interface to orchestrate data flows between various data sources and sensors. NiFi helps enterprises address numerous big data and IoT use cases that require fast data delivery with minimal manual scripting. One example we’re seeing is customers using Apache NiFi to more efficiently run sales and marketing analytics, optimize pricing strategies, predict fraud and identify security threats.
Customers are also using Attunity Replicate to accelerate data movement to NiFi architectures without the need for manual coding. Working together, Attunity Replicate and NiFi create a transformational architecture, as outlined in a book we published with Hortonworks entitled Apache NiFi for Dummies and the accompanying Industry Solution Guide for Using Apache NiFi and Attunity Replicate. What are the possibilities? Network and IT security teams can use big data tools to monitor networks and employee access behavior to protect against security exploits in real-time. And, business analysts can track social network buzz by monitoring feeds from Twitter, Facebook and other sites for sentiment analysis.
Q5. What are the pros and cons of a data lake?
Many enterprises continue to invest resources in data lakes because they provide a highly scalable, efficient and cost-effective platform for storing and processing high volumes of data from a variety of sources.
Data lakes can support innovative analytics engines such as Spark and very efficiently transform data for focused, structured analysis in traditional data warehouses. The downside of the data lake really centers on its complexity.
We see enterprise IT organizations struggle with specialized manual scripting procedures. To reduce this problem, many data architects who grew up on data warehouses and SQL are employing SQL-like structures such as Apache Hive on top of Hadoop and using Attunity Compose to accelerate data warehouse design, development, testing, deployment and updates automatically.
Q6. What is your take on the so-called Lambda architecture?
Lambda has gained traction among enterprises. This is an architecture that applies both batch and real-time processing to a given high-volume dataset to meet latency, throughput, and fault-tolerance requirements while combining comprehensive batch data views with online data views. Technology components vary, but typically Apache Kafka or alternatives like Amazon Kinesis send messages to stream-processing platforms like Storm or Spark Streaming, which in turn feed repositories such as Cassandra or HBase.
The batch layer is often MapReduce on HDFS, which might feed engines such as ElephantDB or Impala. Often queries are answered with merged results from both the batch and real-time engines.
However, Lambda architectures are challenged in their ability to handle massive data streams in real time and multiple data models from multiple sources. To better meet these requirements, we see many enterprises considering an alternative architecture called SMACK, which stands for Spark, Mesos, Akka, Cassandra and Kafka. The Spark component supports both batch and stream data processing in the same application at the same time. SMACK persists data in Cassandra, which guarantees access to historical data and the ability to replay data in the event of an error. Attunity Replicate with change data capture (CDC) technology plays a vital role as it can both publish real-time streams to the data-in-motion infrastructure and write directly to the data-at-rest repositories.
Jordan Martz is the Director of Technology Solutions at Attunity, a leading provider of data integration and data management software. In this role, Jordan works closely with both the alliances and product management teams at Attunity on data lake, streaming, IoT and cloud solutions.