On HPCC Systems Platform. Q&A with Richard Taylor
Q1. What is the difference between a “Data Lake” and “Data Warehouse”?
A Data Lake contains raw data in its original form. That raw data is Extracted from the lake, Loaded into working systems, and Transformed for use by each specific operation (generally data analytics). The advantage of keeping the data in its raw form is that additional data can continually be added without affecting the code required to perform operations on that data. That makes a Data Lake best for use by Data Scientists who are exploring data to find hidden relationships.
A Data Warehouse contains structured data. That data is the end result of the standard ETL (Extract, Transform, Load) process used to pre-process data for use in data products available to end-users. That makes a Data Warehouse best for use in commercial data products for end-user access.
The HPCC Systems platform can be used for either or both — Data Lake and/or Data Warehouse.
Q2. What is the HPCC Systems?
HPCC Systems is a massively-parallel Big Data processing platform whose development started in 1999 as an internal/proprietary tool. It was designed to solve the kind of Big Data problems that we were encountering on a daily basis. It was originally designed to run on commercial off-the-shelf computers in our LexisNexis® Risk Solutions facilities (a bare-metal configuration) but is now supporting both the bare-metal and cloud-native configurations (using Kubernetes so it is agnostic to whose cloud it is run on: Amazon, Azure, etc.).
Q3. In 2011, LexisNexis Risk Solutions decided to release HPCC Systems under an opensource license. Why?
Hadoop had become the de facto industry standard tool for Big Data processing since its Open Source release in 2008. LexisNexis Risk Solutions felt that making its HPCC Systems platform Open Source would help to maintain its relevance in the industry.
The HPCC Systems platform was originally developed by a single team as an all-inclusive, end-to-end platform to encompass the entire data process: from data ingest, through ETL work and final product data design, all the way through to data delivery to end-users. That makes it a much more comprehensive big data solution than other open source options (to this day), because you don’t need to add any third party components to create a full Production installation.
Q4. Once data is added to the data lake, the process of data enrichment begins. What does it mean in practice?
It means pre-processing the data to create the final data product. LexisNexis® Risk Solutions does not sell “data” (anybody can acquire the public record data we start with for many of our products) — we create data products that provide the “information” end users are actually looking for (the result of the hard work of distilling relevant information from raw input data).
Q5. Please describe the HPCC Systems Data Enrichment Pipeline.
1. Adding your own unique identifier to every new record coming in.
2. Cleaning and standardization.
3. Linking disparate data sources to add additional data points and attributes.
4. Tracking changes to data over time.
5. Creating the final product datasets to service end-user queries.
Q6. What are the main components of an HPCC Systems Data Lake? And what are the useful for?
The HPCC Systems platform has a number of infrastructure elements that work with and manage two primary cluster types.
The major infrastructure elements are:
1. Dali — the system data store, keeping track of everything.
2. DFU — Distributed File Utility, handling all aspects of working with distributed datasets so programmers don’t have to.
3. ESP — Enterprise Service Platform, configurable communication layer to handle security, logging, etc.
And the two primary cluster types are:
Thor — the developer’s tool to handle pre-processing all data.
Roxie — the customer-facing tool to handle millions of concurrent queries from end-users.
Q7. Is it possible to integrate the ECL (Enterprise Control Language) declarative programming language with the standard SQL Language to write query and update the data lake?
The short answer is: yes, you can embed SQL code within your ECL code, combining both into a single query. But a better question might be: Why did we create ECL instead of just using SQL?
The answer to that question goes back to the very beginning of the HPCC Systems Platform. It was the fourth quarter of 2000 and we already had a massively parallel computing cluster platform in production that was the precursor to Thor and Roxie, but it used SQL as its query language. A customer came to us with a problem that they wanted us to solve for them. But the sheer complexity of the problem precluded using SQL, so our architects designed a more terse but expressive query language (the beginnings of ECL) and in 90 days we had solved their problem by writing about a thousand ECL definitions to define the result. So, on that precursor platform using our brand new query language, the job that took 28 hours on their IBM mainframe ran in about 6.5 minutes. The first ECL “compiler” just translated those thousand definitions into SQL to run, so we looked at the SQL generated from those definitions and it was over a million lines of SQL. This massive increase in programmer efficiency is why we created ECL instead of just using SQL.
Q8. Who are the main current HPCC Systems users and what do they use HPCC Systems for?
Because the HPCC Systems platform is open source, there is no definitive way to answer that question, because there is no requirement that anybody using it make contact with us. The HPCC Systems platform is the foundation on which almost all LexisNexis Risk Solutions data products worldwide are built. The LexisNexis Legal and Professional business also uses the HPCC Systems platform in their data products, along with other areas within the RELX Group. There are several agencies of the US government and a number of other companies that use the platform. While I am not at liberty to share names of individual companies, several examples of customer case studies can be found on our website.
Q9. Do you offer training and support options to customers interested in applying HPCC Systems to their own data management and analytics needs? If yes, which ones? and where do you find relevant resources?
Yes! We do offer free online training courses to anybody that wants to learn, along with remote and on-site training. This page will get you started.
In addition to our training, there is also the “Mastering HPCC Systems” series of books available on Amazon, Google Play, and Google Books.
All three books are available at Amazon
Each individual book is available at Google Books:
Mastering HPCC Systems: Platform Overview and History
Mastering HPCC Systems: Fundamentals of ETL Processing
Mastering HPCC Systems: ECL Cookbook
Q10. How do you get in contact with the ECL developers community that contribute writing code for the HPCC Systems platform?
The community page will put you in touch with every aspect of our user community.
……………………………………………………………
Richard Taylor has been Chief Trainer for HPCC Systems since its inception, and was the original author of all the ECL language documentation. His most recent publications are the books of the “Mastering HPCC Systems” series, now available for free download from Amazon, Google Play, and Google Books.
Sponsored by HPCC Systems from LexisNexis Risk Solutions