Romeo Kienzler works as Chief Data Scientist in the IBM Watson IoT World Wide team helping clients to apply advanced machine learning at scale on their IoT sensor data. His current research focus is on scalable machine learning on Apache Spark. He is contributor to various open source projects and works as associate professor for data mining at a swiss university. Romeo Kienzler is a member of the IBM Technical Expert Council and the IBM Academy of Technology – IBM’s leading brain trusts.
Q1. Is domain knowledge necessary for a data scientist?
Absolutely! The main reason for that is that without domain knowledge you don’t have an idea on the quality of the data you are analysing. On paper everything might look fine but to a domain expert your findings won’t make any sense.
This is the less disastrous case. It becomes worse if the domain expert can’t spot the mistakes the data scientist is doing. Let me give you an example. I’m a bioinformatician and senior in applied statistics but currently started to work as Chief Data Scientist for IBM Watson IoT where my main focus in on real-time analysis of sensor data streams.
But sensors tend to create very noisy data and you have to understand exactly how the pipeline looks like from analog sensors on the field (and what they are actually measuring) to signal additions due to interference, temperature changes and instabilities in power supply. E.g. in my last project I came to know that my sampling rate of a vibration sensor was too low and I’ve lost some part of the high frequency spectrum (Nyquist-Shannon sampling theorem). They’s my way of becoming a domain expert – talk to the domain expert in your project and clearly understand what they are talking about.
Q2. What should every data scientist know about machine learning?
I can’t speak for everyone but what helped me very much was taking the “Machine Learning” course from Stanford on coursera with Andrew Ng. This changed my life. So to answer your questions. Every data Scientist should have implemented at least one algorithm of the following classes on his own (without using a library or tool): clustering, classification, regression.
Q3. What are the most effective machine learning algorithms?
Neural Networks and Gradient Boosted Trees. In terms of predictive power. In terms of a real-world application I’d go for linear regression like models. Effective because they are fast, don’t over-fit and explain nicely how predictors and target correlate. Always start simple and scale up in iterations.
Q4. Predictive Modeling: How can you perform accurate feature engineering/extraction?
Before the deep learning era I’d say with a lot of experience and knowledge about the domain. But given you have enough data a LSTM basically can learn (and outperform) any manual feature engineering. So in case you have a lot of data (we talk about n > 10 x degree_of_freedom(model)) give LSTMs a shot!
Q5. Can data ingestion be automated?
Yes. Data ingestion is all about assigning data items to some real physical object or event. Even if you don’t have any idea on how this mysterious thing may look like. This in an active area of research at IBM Watson. Most promising methods are deep auto encoders which can compress high dimensional input data to a common set of lower dimensional representation. Using this compressed representation a lot of interesting things can be done in order to determine data items which somehow belong together.
Q6. How do you ensure data quality?
This is again a vote for domain knowledge. I have someone with domain skills assess each data source manually. In addition I gather statistics on the accepted data sets so some significant changes will raise an alert which – again – has to be validated by a domain expert.
Q7. When data mining is a suitable technique to use and when not?
That depends on what you mean by data mining. To my students I teach that data science is data mining at scale plus use of unstructured data. So if data mining is a un-scaled version of structured data analysis I’d say that most of my projects start as a data mining project – especially focusing on explanatory data analysis. You first have to get an idea on how your data looks like before you can actually answer any questions.
Q8. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain?
I’m using the classical statistical performance measures to assess the performance of a model. This is only about the mathematical properties of a model. Then I check with the domain experts on the significance to their problems. Often a statistically significant result is not relevant for the business. E.g. if I tell you that a bearing will break with 95% probability within the next 6 months might not really help the PMQ (Predictive Maintenance and Quality) guys. So the former can be described as “correct” or “good” whereas the latter as “relevant” maybe.
Q9. What were your most successful data projects? And why?
My latest project is also the most successful I think. I’m currently writing up a blog post on that. I’ve implemented a deep neural network auto-encoder for unsupervised anomaly detection capable of detecting anomalies of a chaos theory inspired physical model of bearing vibration sensor data. This model can detect anomalies with a 100 fold difference in re-construction error between the healthy and broken state. I’ve successfully integrated a prototype to the IBM Watson IoT Platform. Currently it still needs an dedicated ApacheSpark cluster attached to it but we are in the process of releasing this as a built-in service in the IBM Watson IoT Platform Edge and Cloud analytics services. So everyone can apply this algorithms to his IoT sensor data stream.
Q10. What are the typical mistakes done when analysing data for a large scale data project? Can they be avoided in practice?
Go simple. Start with a baby model you can grow later. Use a programming language and framework where you are fast. Don’t waste your time with complex algorithms and keep in mind that every minute spent in data cleansing and feature engineering may save you one hour of work downstream.
Q11. What are the ethical issues that a data scientist must always consider?
The same as a non-data scientist. Don’t harm anybody and keep in mind that every decision in live you take will affect destiny of others. In Buddhism there is a saying that every decision you make should increase overall happiness of all beings and avoiding decisions leading to suffering of any living thing on this planet is one of the main goals in live.
Taught by: Romeo Kienzler, Chief Data Scientist.
Who is this class for: This course is designed for developers who want to improve their data analysis skills or data analysts who want to become expert in finding interesting patterns in IoT Sensor Data.