Brett Wujek, Ph.D. is a Senior Data Scientist with the R&D team in the SAS Advanced Analytics division. He helps evangelize and guide the direction of advanced analytics development at SAS, particularly in the areas of machine learning and data mining. Prior to joining SAS, Brett led the development of process integration and design exploration technologies at Dassault Systèmes, helping to architect and implement industry-leading computer-aided optimization software for product design applications. His formal background is in design optimization methodologies, helping establish the Design Automation Lab at the University of Notre Dame, where he received his Ph.D. for his work developing efficient algorithms for multidisciplinary design optimization.
Q1. Is domain knowledge necessary for a data scientist?
In any data science or machine learning application, a good understanding of the business problem at hand is certainly very important – you need to have a good sense for what you are hoping to achieve and what attributes you have available to find patterns or make predictions toward that goal, as well as understanding if what you are seeing from your model(s) makes any sense. But let’s face it, with sufficient historical data a good data scientist can build a pretty good model for insurance fraud or financial risk without knowing much about the insurance or financial industries. In fact, I’d argue that sometimes domain knowledge can make you get a little lazy in the very important feature engineering phase if you aren’t careful. However, application of the models, making sound business decisions from what they are telling you, and understanding when they have become stale requires someone with a reasonable amount of domain knowledge. Ultimately, data science is not practiced in isolation – a small team of people with an assortment of skills working with someone who has adequate domain knowledge for a given project can achieve great results.
Q2. What should every data scientist know about machine learning?
First, I’d say you should know some history. Machine learning is not new; realize, for example, that Arthur Samuel was formulating self-learning programs back in the 1950’s, and neural networks evolved over decades in the mid-to late-twentieth century. Knowing when and how major concepts like boosting came about will give you some good perspective on the latest advances in machine learning, which are typically some variant of prior formulations. And be sure to arm yourself with a good understanding of the distinction, interplay, and overlap among various disciplines (statistical modeling, data mining, machine learning, etc.). Not that I’m saying there are clear-cut answers – there are no definitive boundaries (nor should there be) – but to me largely it comes down to the assumptions made (or ignored) in building the models and the ultimate application of the algorithms. In machine learning applications, we focus mostly in accuracy over interpretability or human comprehension of how the model is providing insight/predictions, as the machine itself is the ultimate consumer. But honestly, you can justifiably spin this in a number of different ways – just have your own good answer for it.
Then, be disciplined enough to build a good base of knowledge with the fundamentals before playing with the new shiny toys. Understand the concepts of feature engineering, the perils of data leakage corrupting honest model assessment, and the nature of the training objective, including balancing accuracy and complexity for better model generalization. Honestly, this is fairly standard predictive modeling practice. I won’t put a plug in here for any single reference, but you can find the most popular ones with some simple internet searches – take the time to read and let it sink in.
Finally, keep up on the latest methods because they are evolving very quickly, but don’t be afraid to start simple when first attacking a new problem. Simple linear/logistic regression or decision trees can at least give you some insight on the nature of the features space, if not provide an adequate model to serve as a baseline.
Q3. What are the most effective machine learning algorithms?
This is like asking “what’s the most effective tool in your toolbox?” My hammer is not going to be much help cutting a piece of wood; my handsaw will certainly work and be appropriate for small pieces of wood; but my power circular saw will be most effective for bigger cuts. Like I said earlier, the simpler the better, as long as it provides the level of results (generalized accuracy or amount of insight gained) that you are striving for.
That being said, it’s pretty clear from Kaggle competitions that when going for the most accurate model possible ensemble models are the way to go. In particular, forms of gradient boosting, such as extreme gradient boosting, tend to do very well, especially when they are then further combined with other models using stacked ensembling techniques. This just makes sense – it follows the “wisdom of the crowd” philosophy as Aristotle put it long ago. Some really smart people are finding novel and effective ways of combining existing tried-and-true techniques. For more focused applications such as image recognition and other cognitive computing tasks, deep convolutional neural networks have been shown to be very effective. Factorization machines have been shown to be quite capable for sparse data like product or movie recommendations.
Overall, the machine learning toolbox is ever-expanding, and distributed/parallel analytics computing platforms are making this much more feasible. But again, you don’t want to use a power saw to cut an apple. Start simple and work your way up.
Q4. Can data ingestion be automated?
Can it be automated? Yes. Should it be automated? It depends. It all comes down to how much you trust the accuracy and stability of the data source. In today’s IoT applications automated data ingestion is what it’s all about – event stream processing data from sensors in real-time (or near real-time) to monitor and make immediate decisions. Mind you, this is typically in the sense of scoring new observations (making predictions from existing models), not training models.
But even for training models, there are online machine learning algorithms that can adapt/update models as new data is made available. The issues here, though, are that (1) it takes a fair amount of IT infrastructure to manage the feedback loop for updating the model, (2) you still have the trust issue in regards to potentially corrupting the model with unrepresentative data, (3) lots of data is not always a good thing (if left unfiltered), and (4) you still have to monitor the model to understand when it may be more appropriate to discard old data and completely retrain.
So I do think the specific process of data ingestion can be automated, and for scoring that may be appropriate (or even necessary) as long as the model health is monitored…but that does not mean it can be blindly fed into your modeling process unattended.
Q5. What are the ethical issues that a data scientist must always consider?
To me, this is one of the most thought-provoking questions we can, and must, continue to ask ourselves. The main ethical considerations are issues of data privacy and fair and appropriate application of the models. Because we typically cut our teeth on canned (often artificially generated) data sets we don’t tend to consider where the data actually came from or what it represents. You have to develop the discipline to be aware that the data often is some representation of the actions, behaviors, and lives of real people, and to continually question whether you really should be exposed to all of that. The major data collection corporations take this responsibility very seriously and keep data aggregated to avoid uncovering and analyzing any specific individual data that can be personally identifiable. We as individual data scientists really need to have this same level of responsibility as we search for, scrape, and extract data from various sources.
In terms of application of the models, we have to always realize that the attributes/features represented in the data could indirectly bias decisions that are made using insight or predictions made from models. And more and more, organizations are using systems backed by these models to automate decisions related to how they offer their services. It’s very difficult to uncover the bias once it’s embedded in the model; and in some cases a feature that might be considered a discriminating factor may actually be a rightful use of distinguishing information. I don’t know that there is a clear-cut solution other than just ensuring we have an awareness that there may be ethical questions to consider in any project, and educating ourselves to provide a solid foundation of ethics in machine learning so that we avoid reinforcing and proliferating biases.