Mike Shumpert is a Vice President of Analytics at Software AG. He has 15 years of experience in data science and machine learning across a broad range of industries and use cases. Mike received a Bachelor of Science degree in Systems Engineering with minors in Operations Research and Electrical Engineering from the University of Virginia and a Master of Business Administration degree from Georgetown University.
Q1. Is domain knowledge necessary for a data scientist?
For junior data scientists just starting out, domain knowledge is not that critical. They first need to learn and apply the fundamentals such as best practice pipelines/workflows, machine learning algorithms with suitability for different problem types, and as many tools as possible, including those for data extraction, preparation, and transformation.
However, the biggest challenge facing data scientists is the ability to explain their results. This is critical for funding projects, whether in academia, government or industry.
If the business sponsors do not understand or believe your approach, then your best ideas may never get off the ground.
So more senior data scientists – those commanding the highest salaries – have generally worked deeply in one or more domains. Their understanding of the industry and use cases not only helps in feature engineering, but their ability to talk the language of the business gives them an edge in project approvals and budgets.
While a certain depth of domain knowledge is required to gain these advantages, breadth across several industries is also highly prized and can be more satisfying in the long run.
Q2. What should every data scientist know about machine learning?
Some data scientists get caught up in using as many algorithms as they can and prioritize new ones over ones that have been around awhile. Certainly, advances continue to be made, but most new algorithms and techniques are being introduced to improve performance for specific use cases that have been underserved by other algorithms. As such, they might not generalize to other uses cases very well, if at all. Understanding the type of use cases an algorithm is best suited for is crucial for efficient data science projects.
Also, it is very helpful to understand why a given algorithm is best suited for particular problems instead of blindly varying different parameters in an attempt to increase performance.
I highly recommend hand-implementing the basic algorithms in the language of your choice to understand how they work. This is not meant as a replacement of off-the-shelf implementations – they will still be more efficient and offer more options for real world problems – but the insight gained from the hand implementation will allow you to better understand when to use the algorithm and how to tune it.
Q3. What are the most effective machine learning algorithms?
As stated above, most algorithms were developed in response to some specific challenge with existing algorithms for a given use case. So their optimization for that challenge likely means they might not be as well suited for other problems. Of course, there are some algorithms that generalize better than others, but the best results will always depend on the type of problem and the data available.
Q4. Predictive Modeling: How can you perform accurate feature engineering/extraction?
It might sound obvious that the starting point for all machine learning is the data itself, but too many data scientists jump into trying out various algorithms without really considering what the data can tell them. There are important questions one has to ask at the very beginning: How much data do I have (depth)? How many features do I have (breadth)?
How much data is missing (and do I care)? Are there physical properties or heuristics associated with the underlying problem that can inform what new features might make sense?
Can I derive new features from existing ones or are there other sources for additional ones?
It might be counterintuitive that too many features can be as big an issue as too few. Regularization techniques have emerged to handle the former, while considerable creativity is often required for the latter.
With respect to the depth and sparsity of the data, someone once said, “You can tell a lot about a data scientist from the way they treat missing data.” Those that come from a background of statistics will typically want to use various methods to impute the missing values, whereas those coming from a “big data” perspective will just throw out the entire instance if one feature is missing. And bear in mind that just because you have a lot of data does not mean it is representative of the problem you are trying to predict. A data set with a million rows capturing the behavior of one specific machine breakdown is not going to tell you how to predict all failures for that equipment.
Q5. How do you ensure data quality?
On the one hand, one of the basic tenets of “big data” is that you can’t ensure data quality – today’s data is voluminous and messy, and you’d better be prepared to deal with it. As mentioned before, “dealing with it” can simply mean throwing some instances out, but sometimes what you think is an outlier could be the most important information you have.
So if you want to enforce at least some data quality, what can you do? It’s useful to think of data as comprising two main types: transactional or reference. Transactional data is time-based and constantly changing – it typically conveys that something just happened (e.g., customer checkouts), although it can also be continuous data sampled at regular intervals (e.g., sensor data). Reference data changes very slowly and can be thought of as the properties of the object (customer, machine, etc.) at the center of the prediction.
Both types of data typically have predictive value: this amount at this location was just spent (transactional) by a platinum-level female customer (reference) – is it fraud? But the two types often come from different sources and can be treated differently in terms of data quality.
Transactional data can be filtered or smoothed to remove transitory outliers, but the problem domain will determine whether or not any such anomalies are noise or real (and thus very important). For example, the $10,000 purchase on a credit card with a typical maximum of $500 is one that deserves further scrutiny, not dismissal.
But reference data can be separately cleansed and maintained via Master Data Management (MDM) technology. This ensures there is only one version of the truth with respect to the object at the core of the prediction and prevents nonsensical changes such as a customer moving from gold status to platinum and back again within 30 seconds. Clean reference data can then be merged with transactional data on the fly to ensure accurate predictions.
Using an Internet of Things (IoT) example, consider a predictive model for determining when a machine needs to be serviced. The model will want to leverage all the sensor data available, but it will also likely find useful factors such as the machine type, date of last service, country of origin, etc. The data stream coming from the sensors usually will not carry with it the reference data and will probably only provide a sensor id. That id can be used to look up relevant machine data and enrich the data stream on the fly with all the features needed for the prediction.
One final point on this setup is that you do not want to go back to the original data sources of record for this on-the-fly enrichment of transactional data with reference data.
You want the cleansed data from the MDM system, and you want that stored in memory for high-performance retrieval.
Q6. What are the typical mistakes done when analyzing data for a large scale data project? Can they be avoided in practice?
Many data scientists don’t think all the way through to the end game of how predictions will be executed when faced with a large scale data project. Generally speaking, a large amount of historic data for training a model implies that there will be a fair amount of data when executing the model. So one needs to anticipate how that data will be integrated and transformed on the fly to match the exact steps taken with the training data. You need to understand the throughput requirements for the end application and processing capabilities of the infrastructure – if your elegant ensemble model performs with high accuracy in the lab but is too slow in production, you will need to start over.
Also, you need to understand how often the model is likely to need to be updated. Data scientists typically develop models using one set of tools (R, Python, etc.) and then hand off the model to an entirely different IT group to implement the model in production using another set of tools (Java, C, etc.). Speaking different languages, it is common for the two groups to iterate between requirements and development several times over before getting it right. And if the model has only just been promoted to production when it’s time to start over again, this can be a big drag on performance and accuracy: your model is always out of date by the time it is implemented.
The best way to avoid this is to use a standards-based approach that allows model development to be insulated from model implementation. The Predictive Model Markup Language (PMML) is the de-facto industry standard for expressing all the major algorithms in a standard way and can be exported by all of the leading commercial and open source data mining platforms.
IT can then use a PMML execution engine to import the models for predictions with no implementation required. This approach can reduce the model development cycle from weeks to days or even hours while providing the high performance required for the most demanding of user applications.
Another advantage of the PMML approach is that the decoupling of the modeling tools from the implementation means that data scientists are free to use their favorite tools or try new ones without worrying about the impact on the data project schedule.