On Hadoop and Big Data. Interview with John Leach
“One common struggle for data-driven enterprises is managing unnecessarily complicated data workflows with bloated ETL pipelines and a lack of native system integration.”– John Leach
I have interviewed John Leach, CTO & Cofounder Splice Machine. Main topics of the interview are Hadoop, Big Data integration and what Splice Machine has to offer in this space. Monte Zweben, CEO of Splice Machine also contributed to the interview.
Q1. What are the Top Ten Pitfalls to Avoid in a SQL-on-Hadoop Implementation?
John Leach, Monte Zweben:
1. Individual record lookups. Most SQL-on-Hadoop engines are designed for full table scans in analytics, but tend to be too slow for the individual record lookups and ranges scan used by operational applications.
2. Dirty Data. Dirty data is a problem for any system, but it is compounded in Big Data, often resulting in bad reports and delays to reload an entire data set.
3. Sharding. It can be difficult to know what key to distribute data and the right shard size. This results in slow queries, especially for large joins or aggregations.
4. Hotspotting. This happens when data becomes too concentrated in a few nodes, especially for time series data. The impact is slow queries and poor parallelization.
5. SQL coverage. Limited SQL dialects will make it so you can’t run queries to meet business needs. You’ll want to make sure you do your homework. Compile the list of toughest queries and test.
6. Concurrency. Low concurrency can result in the inability to power real-time apps, handle many users, support many input sources, and deliver reports as updates happen.
7. Columnar. Not all columnar solutions are created equally. Besides columnar storage, there are many other optimizations, such as vectorization and run length encoding that can have a big impact on analytic performance. If your OLAP queries run slower, common with large joins and aggregations, this will result in poor productivity. Queries may take minutes or hours instead of seconds. On the flip-side is using columnar when you need concurrency and real-time.
8. Node Sizing. Do your homework and profile your workload. Choosing the wrong node size (e.g., CPU cores, memory) can negatively impact price/performance and create performance bottlenecks.
9. Brittle ETL on Hadoop. With many SQL-on-Hadoop solutions being unable to provide update or delete capabilities without a full data reload, this can cause a very brittle ETL that will require restarting your ETL pipeline because of errors or data quality issues. The result is a missed ETL window and delayed reports to business users.
10. Cost-Based Optimizer. A cost-based optimizer improves performance by selecting the right join strategy, the right index, and the right ordering. Some SQL-on-Hadoop engines have no cost-based optimizer or relatively immature ones that can result in poor performance and poor productivity, as well as manual tuning by DBAs.
Q2. In your experience, what are the most common problems in Big Data integration?
John Leach, Monte Zweben: Providing users access to data in a fashion they can understand and at the moment they need it, while ensuring quality and security, can be incredibly challenging.
The volume and velocity of data that businesses are churning out, along with the variety of different sources, can pose many issues.
One common struggle for data-driven enterprises is managing unnecessarily complicated data workflows with bloated ETL pipelines and a lack of native system integration. Businesses may also find their skill sets, workload, and budgets over-stretched by the need to manage terabytes or petabytes of structured and unstructured data in a way that delivers genuine value to business users.
When data is siloed and there is no solution put into place, businesses can’t access the real-time insights they need to make the best decisions for their business. Performance goes down, headaches abound and cost goes way up, all in the effort to manage the data. That’s why a Big Data integration solution is a prerequisite for getting the best performance and the most real-time insights, at the lowest cost.
Q3. What are the capabilities of Hadoop beyond data storage?
John Leach, Monte Zweben: Hadoop has a very broad range of capabilities and tools:
– Oozie for workflow
– Pig for scripting
– Mahout or SparkML for machine learning
– Kafka and Storm for streaming
– Flume and Sqoop for integration
– Hive, Impala, Spark, and Drill for SQL analytic querying
– HBase for NoSQL
– Splice Machine for operational, transactional RDBMS
Q4. What programming skills are required to handle application development around Big Data platforms like Hadoop?
John Leach, Monte Zweben: To handle application development on Hadoop, individuals have choices to go raw Hadoop or SQL-on-Hadoop. When going the SQL route, very little new skills are required and developers can open connections to an RDBMS on Hadoop just like they used to do on Oracle, DB2, SQLServer, or Teradata. Raw HAdoop application developers should know their way around the core components of the Hadoop stack–such as HDFS, MapReduce, Kafaka, Storm, Oozie, Hive, Pig, HBase, and YARN. They should also be proficient in Java.
Q5. What are the current challenges for real-time application deployment on Hadoop?
John Leach, Monte Zweben: When we talk about real-time at Splice Machine, we’re focused on applications that require not only real-time responses to queries, but also real-time database updates from a variety of data sources. The former is not all that uncommon on Hadoop; the latter is nearly impossible for most Hadoop-based systems.
Deploying real-time applications on Hadoop is really a function of moving Hadoop beyond its batch processing roots to be able to handle real-time database updates with high concurrency and transactional integrity. We harness HBase along with a lockless snapshot isolation design to provide full ACID transactions across rows and tables.
This technology enables Splice Machine to execute the high concurrency of transactions required by real-time applications.
Q6. What is special about Splice Machine auto-sharding replication and failover technology?
John Leach, Monte Zweben: As part of its automatic auto-sharding, HBase horizontally partitions or splits each table into smaller chunks or shards that are distributed across multiple servers. Using the inherent failover and replication capabilities of HBase and Hadoop, Splice Machine can support applications that demand high availability.
HBase co-processors are used to embed Splice Machine in each distributed HBase region (i.e., data shard). This enables Splice Machine to achieve massive parallelization by pushing the computation down to each distributed data shard without any overhead of MapReduce.
Q7. How difficult is it for customers to migrate from legacy databases to Splice Machine?
John Leach, Monte Zweben: Splice Machine offers a variety of services to help businesses efficiently deploy the Splice Machine database and derive maximum value from their investment. These services include both implementation consulting and educational offerings delivered by our expert team.
Splice Machine has designed a Safe Journey program to significantly ease the effort and risk for companies migrating to a Splice Machine database. The Safe Journey program includes a proven methodology that helps choose the right workloads to migrate, implements risk-mitigation best practices, and includes commercial tools that automate most of the PL/SQL conversion process.
This is not to suggest that all legacy databases will convert to a Hadoop RDBMS.
The best candidates will typically have over 1TB of data, which often leads to cost and scaling issues in legacy databases.
Q8. You have recently announced partnership with Talend, mrc (michaels, ross & cole ltd.) and RedPoint Global. Why Talend, mrc, and RedPoint Global? What is the strategic meaning of these partnerships for Splice Machine?
John Leach, Monte Zweben: Our uptick in recent partnerships demonstrates the tremendous progress our team has made over the past year. We have been working relentlessly to develop the Splice Machine Hadoop RDBMS into a fully enterprise-ready database that can replace legacy database systems.
The demand for programming talent to handle application development is growing faster than the supply of skilled talent, especially around newer platforms like Hadoop. We partnered with mrc to give businesses a solution that can speed real-time application deployment on Hadoop with the staff and tools they currently have, while also offering future-proof applications over a database that scales to meet increasing data demands.
We partnered with Talend to bring our customers the benefit of two different approaches for managing data integration affordable and at scale. Talend’s rich capabilities including drag and drop user interface, and adaptable platform allow for increased productivity and streamlined testing for faster deployment of web, mobile, OLTP or Internet of Things applications.
And finally, we integrated and certified our Hadoop RDBMS on RedPoint’s Convergent Marketing Platform™ to create a new breed of solution for marketers. With cost-efficient database scale-out and real-time cross-channel execution, the solution enables enterprises to future-proof their marketing technology investment through affordable access to all their data (social, mobile, click streams, website behaviors, etc.) across a proliferating and ever-changing list of channels. Furthermore, it complements any existing Hadoop deployment, including those on the Cloudera, MapR and Hortonworks distributions.
Q9. How is Splice Machine working with Hadoop distribution partners –such as MapR, Hortonworks and Cloudera?
John Leach, Monte Zweben: Since Splice Machine does not modify HBase, it can be used with any standard Hadoop distribution that includes HBase, including Cloudera, MapR and Hortonworks. Splice Machine enables enterprises using these three companies to tap into real-time updates with transactional integrity, an important feature for companies looking to become real-time, data-driven businesses.
In 2013, Splice Machine partnered with MapR to enable companies to use the MapR distribution for Hadoop to build their real time, SQL-on-Hadoop applications. In 2014, we joined the Cloudera Connect Partner Program, after certifying on CDH 5. We are working closely with Cloudera to maximize the potential of its full suite of Hadoop-powered software and our unique approach to real-time Hadoop.
That same year, we joined Hortonworks Technology Partner program. This enabled our users to harness innovations in management, provisioning and security for HDP deployments. For HDP users, Splice Machine enables them to build applications that use ANSI-standard SQL and support real-time updates with transactional integrity, allowing Hadoop to be used in both OLTP and OLAP applications.
Earlier this year, we were excited to achieve Hortonworks® Data Platform (HDP™) Certification. With the HDP certification, our customers can leverage the pre-built and validated integrations between leading enterprise technologies and the Hortonworks Data Platform, the industry’s only 100-percent open source Hadoop distribution, to simplify and accelerate their Splice Machine and Hadoop deployments.
Q10 What are the challenges of running online transaction processing on Hadoop?
John Leach, Monte Zweben: With its heritage as a batch processing system, Hadoop does not provide the transaction support required by online transaction processing. Transaction support can be tricky enough to implement for shared-disk RDBMSs such as Oracle, but it becomes far more difficult to implement in distributed environments such as Hadoop. A distributed transactional model requires high-levels of coordination across a cluster with too much overhead, while simultaneously providing high performance for a high concurrency of small read and writes, high-speed ingest, and massive bulk loads. We prove this by being able to run the TPC-C benchmark at scale.
Splice Machine met those requirements by using distributed snap isolation, a Multi-Version Concurrency Control model that delivers lockless, and high-concurrency transactional support. Splice Machine extended research from Google’s Percolator project, Yahoo Lab’s OMID project, and the University of Waterloo’s HBaseSI project to develop its own patent-pending, distributed transactions.
John Leach – CTO & Cofounder Splice Machine
With over 15 years of software experience under his belt, John’s expertise in analytics and BI drives his role as Chief Technology Officer. Prior to Splice Machine, John founded Incite Retail in June 2008 and led the company’s strategy and development efforts. At Incite Retail, he built custom Big Data systems (leveraging HBase and Hadoop) for Fortune 500 companies.
Prior to Incite Retail, he ran the business intelligence practice at Blue Martini Software and built strategic partnerships with integration partners. John was a key subject matter expert for Blue Martini Software in many strategic implementations across the world. His focus at Blue Martini was helping clients incorporate decision support knowledge into their current business processes utilizing advanced algorithms and machine learning.
John received dual bachelor’s degrees in biomedical and mechanical engineering from Washington University in Saint Louis. Leach currently is the organizer for the Saint Louis Hadoop Users Group and is active in the Washington University Elliot Society.
Monte Zweben – CEO & Cofounder Splice Machine
A technology industry veteran, Monte’s early career was spent with the NASA Ames Research Center as the Deputy Chief of the Artificial Intelligence Branch, where he won the prestigious Space Act Award for his work on the Space Shuttle program.
Monte then founded and was the Chairman and CEO of Red Pepper Software, a leading supply chain optimization company, which merged in 1996 with PeopleSoft, where he was VP and General Manager, Manufacturing Business Unit.
In 1998, Monte was the founder and CEO of Blue Martini Software – the leader in e-commerce and multi-channel systems for retailers. Blue Martini went public on NASDAQ in one of the most successful IPOs of 2000, and is now part of JDA.
Following Blue Martini, he was the chairman of SeeSaw Networks, a digital, place-based media company. Monte is also the co-author of Intelligent Scheduling and has published articles in the Harvard Business Review and various computer science journals and conference proceedings.
Zweben currently serves on the Board of Directors of Rocket Fuel Inc. as well as the Dean’s Advisory Board for Carnegie-Mellon’s School of Computer Science.