From Data Wrangling to Data Harmony
- Author: Anirudh Todi
- Date: September 19, 2015
The twin phenomena of big data and machine learning are combining to give organizations previously unheard of predictive power to drive their businesses in new ways. But behind the big data headlines that tease us with tales of amazing insight and business optimization lurks an inconvenient truth: raw data is very dirty and requires an enormous amount of effort to clean.
The hype surrounding big data masks a dirty little secret: most data sets are relatively dirty and must be thoroughly cleaned, lest the resulting analytic results be tainted and unusable. Specifically, organizations are coming to appreciate that “data wrangling” is a crucial first step in preparing data for broader analysis – and one that consumes a significant amount of time and effort. However, too many people regard wrangling as janitorial work; as an unglamorous rite of passage before sitting down to do “real” work. In fact, wrangling is as much a part of the data analysis process as the final results. Wrangling, properly conducted, gives you insights into the nature of your data that then allow you to ask better questions of it. Wrangling is not something that’s done in one fell swoop, but iteratively. Each step in the wrangling process exposes new potential ways that the data might be “re-wrangled,” all driving towards the goal of generating the most robust final analysis.
One way of looking at it is that having a big data set is like owning a big house. There are many more activities available to you and your friends, but at the end of the day, keeping the house clean and tidy is a major expense. What is absolutely essential is a way to automate the process of cleaning, normalizing, and preparing the big data set (the house) for the next analysis (party). It is this data preparation that is the bane of the average data scientists. If you go talk to any data scientist and ask them “How much time do you spend in data prep?” – they will likely answer saying, “Uh, that’s the worst part.”
While big data platforms like Hadoop and NoSQL databases have given us amazing new capabilities for storing and making sense of data, those big data platforms have done little to solve the elementary data transformation issues facing us today. In some ways, Hadoop and NoSQL databases may have even exacerbated the problem because of the perception that big data has been tamed, so why not collect more of it? If anything, data scientists and data analysts are struggling even more under the weight of big data and the demands to crunch it, crack it, and get something useful out of it.
More and better automation tools such as machine-learning technologies are needed to free data scientists from mundane “data-wrangling” chores. Those tools would allow scientists to focus on gleaning insights from prepared data.
In a recent survey of the state of big data, a range of experts told the New York Times that data scientists spend from 50 percent to 80 percent of their time organizing data, performing “data janitor work,” before they can begin sifting through it for nuggets.
Luckily, the big data community has responded with new offerings that take data transformations to new automated heights. Necessity is the mother of all invention, which is why some smart folks at startups like Trifacta, Tamr, and Paxata – not to mention existing companies like Informatica, IBM, and Progress Software – are turning this need into a winning business model.
These companies are on the cutting edge of data transformation. They were founded with the principle that there has to be a better way to clean, normalize, and otherwise prepare data for analytics. They are exploring innovative use of visual displays and machine learning algorithms to help automate the data preparation process. All these automation efforts are aimed at making it easier to prepare and harmonize data, thereby speeding up data analysis.
Although a machine can’t solve 100 percent of the data wrangling and blending problem because data is unpredictable and complex, solving as much of the problem as is possible through machine automation is a big leap forward. To start with, automation may be most accurate on semi-structured and structured data, while adhoc free-form user-generated data may require human intervention to shape the data into a machine-readable form. However, this is not to say that unstructured data would not lend itself to more automation over time, but in the near-term it might present the most challenges for accurate machine-readability. Ultimately, a machine with semantic recognition capabilities could learn patterns with a high degree of accuracy as more data is ingested and processed, whether the data is structure, semi-structured or unstructured.
Data wrangling is not a new problem. It’s plagued companies of all kinds for decades. But as data explodes in volume and diversity, the need for machines to automate data wrangling becomes even more urgent. Machines can go a long way in automating what’s been the most painful and costly part of harnessing big data. New types of data continues to get generated from newly invented thereby creating an ever-moving finishing line that makes 100 percent automation increasingly difficult. While we may never hit the complete automation, there is a clear need to invest heavily in the problem of data cleansing and transformation with the power of advanced machine learning algorithms.