Q&A with Data Engineers: Josh Poduska
Josh Poduska is a Senior Data Scientist in HPE’s Big Data Software Group. Josh has 16 years of experience in the analytical sciences with an emphasis on machine learning and statistical applications. He spent the last five years focusing on advanced analytical solutions with MPP columnar databases. At HPE he is part of the Vertica team and uses Vertica and its machine learning library to help organizations solve their toughest data challenges.
Q1. What are the main technical challenges you typically face in this era of Big Data Analytics?
Let me first point out that we are all lucky to live and work in a time of amazing technological advancements.
Today, fast, distributed storage and analytical compute solutions are available to almost all organizations.
As recently as the George W. Bush presidency, only the privileged few had access to this analytical firepower resulting in a huge competitive inequality in established industries and the lack of creativity in emerging sectors. The playing field is much more level today and we are seeing the results in the disruption of industries by analytically minded organizations who know how to leverage technology. Yes, there are challenges. If you go open source you are trading software license costs for higher headcount and had better hope you have a good HR team in order to hire and retain the right talent.
One technical challenge they will face will be stitching together the various open source solutions, each of which is on a different release cadence. In my experience, the biggest open source gap right now is in efficient use of hardware.
Because each piece of the open source stack is independent, they just aren’t optimized coherently.
Meeting investigative, business, discovery, deep, and predictive analytic SLAs while serving a high concurrent user base is an infrastructure resource nightmare for today’s IT leaders. If you base you core distributed system on proprietary software you can avoid much of the organizational costs of maintaining talented engineers and, if you hitch your wagon to the right horse, you can meet the same analytical SLAs and serve the same concurrency as a pure open source solutions, with about 1/3 the hardware. This applies on-prem and in the cloud. But be careful. You don’t want to find yourself locked into an expensive and rigid solution while your competition innovates.
Q2. How do you gain the knowledge and skills across the plethora of technologies and architectures available?
If you are an organization, you hire teams, not individuals. Don’t try to find the next Alex Rodriguez of data science.
If you find him, hiring and then keeping him is going to be very difficult. Instead take a page from the Money Ball philosophy and hire a team of solid performers. If you are going to invest in the super star, find the best technical leader you can and keep her happy. She can be the glue to hold the team together.
Q3. What lessons did you learn from using shared-nothing scale-out architectures, which claim to allow for support of large volumes of data?
They are the only way to go. And to add to the point, most organizations will want to avoid an appliance model. The idea is to be nimble. Your demand needs will change. Scaling analytical performance and resource support linearly with hardware is a must have in today’s market.
Q4. What are in your opinion the differences between capacity, scale, and performance?
I would define capacity as the ability to store massive data sets efficiently. Here we look for smart compression and perhaps modern encoding techniques. I’ve heard scale used in two ways in industry today. One is the ability to easily scale-up as more storage or performance is required. The other speaks to doing analytics at-scale and combines the ideas of both capacity and performance. Performance is all about meeting SLAs. Be sure to carefully define the type of analytics needed and the user base to be served before measuring performance.
Q5. There are several types of new data stores including NoSQL, XML-based, Document-based, Key-value, Column-family and Graph data stores. What is your experience with those?
There is a place for each of these in the analytical landscape. I often see organizations using a modern column store as their core asset, the center of their analytical strategy. The top column store solutions today can handle a variety of data types natively with tools for schema on read. This alleviates some of the need for the other solutions. And, of course, the best column stores have the tried and true features that make them so attractive such as advanced compression to minimize footprints, years of development on their query optimizers to ensure fast analytics, mature enterprise features like security and role management, very low touch from a DBA perspective, built in machine learning libraries, mastering relational data, and the appearance of a standard RDBMS to users and third party tools. The other tools are great and will definitely be needed in some organizations but usually end up as point solutions. NoSQL and Document-based systems are perhaps the exception.
They can handle structured and relational data well and are getting mature enterprise features. However, they lack the firepower to compete with modern column-stores on speed of analytics. So it often comes down to speed and advanced analytical functionality, versus flexibility of data storage types. In my experience, speed and analytical prowess are in high demand today.
Q6. How do manage to communicate (if any) with Data Scientists on common data projects?
I advocate for the placement of Data Scientists in a centralized Center of Excellence. They can be dotted line attached to a business unit if that makes sense, but it really helps if they get together often. Data Scientists don’t like to work alone. They need to bounce ideas off each other. They are good at iterating but will get caught in an infinite loop if they only iterate with their own ideas. A strong Data Science leader will be needed to keep the team communicating effectively and help them stay current in their field.
Q7. How do you convince/ explain to your managers the results you derived from your work?
As the famous statistician, Dr. Deming, said, “In God we trust. All others bring data.” We live in the age of data.
Do your homework and quantify the impact of your work. If you work in an organization that doesn’t value a data-driven culture, be the change you want to see. If the culture is too rigid to change, then I’d suggest dusting off the resume as your organization may not thrive long in today’s landscape.
Q8. What are the barriers to applying machine learning at scale?
There are technical and soft barriers that organizations need to consider when deciding on a strategy for applied machine learning at scale.
From a technical standpoint, you need to be careful and only consider solutions that have machine learning algorithms rewritten from scratch specifically for distributed and parallel systems. This is a huge barrier right off the bat as only a handful of options exist that meet this criteria. Legacy players have been slow to adapt. Database giants like Oracle, Microsoft, and IBM put more emphasis on making old technology run faster with infrastructure changes and code-base workarounds. While, analytical giants like SAS and IBM/SPSS are just now realizing that their single server based analytics are outdated and are trying to catch up to the distributed MPP train that has already left the station. Of course, programming solutions like R and python are not MPP either and seriously struggle to do analytics at scale.
Spark, H2O, Revolution (bought by Microsoft), Vertica Machine Learning, and MADlib (typically deployed in Greenplum) are the most popular, truly parallel and distributed, machine learning options I see in the market. There are others, but I will stick to these as they are the most popular and give me a chance to contrast and compare the different approaches to machine learning at scale. I do not consider Mahout and other map-reduce based options as they are falling out of favor fast due to the limitations of map-reduce. Spark, H2O, and Revolution are stand-alone deployments. They need to be integrated with a storage layer like HDFS. Vertica and MADlib come deployed with an MPP column-store database, and while they can be integrated with HDFS via external tables (or installed directly on HDFS in the case of Vertica) they don’t have to be. There are pros and cons to each approach.
Spark, H2O, and Revolution offer a functional/algorithmic approach – providing distributed algorithms for ingest, processing, preparation, and machine learning tasks. Of these three, Spark has the best support for customized ingest and complex data preparation. This is in part due to the flexibility that Spark gets from its high-level language support with PySpark, Java, and Scala. Organizations take these functions and algorithms and apply them to their data store, which is typically HDFS and Hadoop, but can also be cloud stores, RDBMSs, column stores, NoSQL, and/or document stores.
H2O and Revolution require their own dedicated hardware. Spark can coexists on the Hadoop cluster but is a resource hog.
All three are memory bound. This can be a big problem in practice. OOM errors are common. Users have to manage memory usage carefully. This also restricts concurrency. Multiple data scientists cannot share resources. Spark, H2O, and Revolution each require a completely contained resource instance (memory and compute) for each user. This can be a real impediment to agile analytical innovation.
Vertica and Greenplum + MADlib have a similar suite of functions/algorithms for data preparation and machine learning, however they are tightly integrated with their storage layer. This gives them better performance for most workloads and much better concurrency. It does require sucking data into the database, which can be a performance disadvantage if you only intend the data to stay there for the duration of the workload. Of course this is a performance advantage if the data already exists in the database. Vertica and Greenplum use ANSI SQL and SQL-based extensions to perform streaming and batch ingest, data processing, data transformations, data organization, complex aggregations, merging multiple datasets, and machine learning. This expands the user base that has access to machine learning tools, enabling business specialists who know SQL to leverage their domain knowledge to build predictive models without having to be a computer scientist. In essence, this removes the biggest soft barrier to machine learning at scale. However, there are analytic workloads that just don’t lend themselves to SQL in a column store, like row based processing requiring loops. Each organization must determine what makes the most sense for their organization structure and analytical needs.
Q9. Do you have experience with In-database machine learning?
I’ll add one carry-over point from the previous response. I believe there is a good chance that we will soon see in-db massively parallel and distributed machine learning begin to take market significant market share away from Hadoop-based machine learning solutions like Spark. The population that knows SQL is much larger than that of high-level programmers.
The majority of business critical data that needs to be modeled is still relational. MPP databases are getting better and better at processing semi-structured data. MPP databases are an attractive solution for the single source of truth, hot data, centralized store that organizations need. They inherently handle investigative and BI analytics well.
Hadoop can play the role of long term cold storage. Many organizations will find this strategy offers better analytical performance, higher analytical concurrency, a more simplified architecture, easier system management, and overall lower TCO.