Q&A with Data Scientists: Mohammed Guller
Mohammed Guller is the author of the book, “Big Data Analytics with Spark,” which is used as a textbook or reference book at many Universities around the world. As a Big Data and Spark expert, he is frequently invited to speak or conduct Spark workshops at Universities and Big Data conferences.
Mohammed works as the principal architect at Glassbeam, where he leads the development of advanced and predictive analytics products. He is passionate about building new products and machine learning. He has successfully led the development of several innovative products from concept to release.
Mohammed founded and ran TrustRecs.com before joining Glassbeam. Before that, he was at IBM for five years. Prior to IBM, he worked at a number of start-ups, leading new product development.
Mohammed has a Master’s of Business Administration (MBA) from the University of California, Berkeley, and a Master’s of Computer Applications from RCC, Gujarat University, India.
Q1. Is domain knowledge necessary for a data scientist?
Yes, I believe that domain knowledge is necessary for a data scientist. A typical data science project consists of multiple steps and domain knowledge can make a big difference in each step. One of the important steps in a data science project is exploratory analysis. In the absence of domain knowledge, it will be difficult for a data scientist to make sense of the data. In addition, without domain knowledge, it would be difficult to detect and fix data quality problems. Data is rarely clean. Domain knowledge is also important when a data scientist uses data for machine learning and creates predictive models.
Q2. What should every data scientist know about machine learning?
Machine learning is an effective tool for solving a variety of problems, but it is not a panacea or silver bullet for every problem. The success of a machine learning project depends on data. You could have the most sophisticated algorithm, but if the data lacks certain attributes, you cannot do much with it. By definition, machine learning is the science of training software to learn from data and find patterns or relationships in data. That means data needs to have some pattern or relationships. If that is missing, you cannot use machine learning. In addition, you need to have enough data for a machine learning algorithm to be to figure out the patterns and relationships between different attributes. In other words, the more data you have the better results you will get from machine learning.
Q3. What are the most effective machine learning algorithms?
It depends. I don’t think there is a set of algorithms that will be effective for every machine learning problem. It depends on the problem and on the data. Certain algorithms are better suited for certain types of problems. Similarly, data characteristics can impact the effectiveness of an algorithm. Having said that, deep learning algorithms have turned out to be one of the most effective machine learning algorithms. People are using it to solve a variety of problems.
Q4. Predictive Modeling: How can you perform accurate feature engineering/extraction?
We discussed the importance of domain knowledge earlier. To perform accurate feature engineering, you need to have good domain knowledge. Besides that, feature engineering involves a little bit of trial-and-error.
Q5. Can data ingestion be automated?
Yes, it can be automated. Data is essentially generated by a system or an application. You can build a batch or streaming data pipeline for automating data ingestion. Historically, batch mode has been used for data ingestion, but newer applications are implementing streaming data pipelines.
Q6. How do you ensure data quality?
It is a tough problem. Data quality issues generally occur upstream in the data pipeline. Sometimes the data sources are within the same organization and sometimes data comes from a third-party application. It is relatively easier to fix data quality issues if the source system is within the same organization. Even then, the source may be a legacy application that nobody wants to touch.
So you have to assume that data will not be clean and address the data quality issues in your application that processes data. Data scientists use various techniques to address these issues. Again, domain knowledge helps.
Q7. When data mining is a suitable technique to use and when not?
Data mining is a broad term that includes machine learning, statistical analysis, and business intelligence or plain old data analysis. It is an effective tool for getting insights from data. It is a suitable technique for making data-driven decisions and building smarter systems or applications. It is also a suitable technique for finding and preventing security threats.
It is not effective in the cases where data has too much noise or if there is a lot of randomness in data. For example, it has not been yet found very effective for making trading decisions in the stock, bond or forex market.
Q8. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain?
This is where domain knowledge helps. In the absence of domain knowledge, it is difficult to verify whether the insight obtained from data analytics is correct. A data scientist should be able to explain the insights obtained from data analytics. If you cannot explain it, chances are that it may be just a coincidence. There is an old saying in machine learning, “if you torture data sufficiently, it will confess to almost anything.”
Another way to evaluate your results is to compare it with the results obtained using a different technique. For example, you can do backtesting on historical data. Alternatively, compare your results with the results obtained using incumbent technique. It is good to have a baseline against which you can benchmark results obtained using a new technique.
Q9. What are the typical mistakes done when analyzing data for a large scale data project? Can they be avoided in practice?
It is a broad question covering a broad topic. Let me answer this just in the context of machine learning. One of the common mistakes is not formulating the problem correctly. Even before you look at the data, have a good understanding of the question that you want to answer by analyzing data. Another common mistake is not checking data quality. Garbage in garbage out. Similarly, not understanding the statistical properties of a dataset before applying any algorithm may lead to misleading results. Overfitting or under-fitting a machine learning model is another common mistake in machine learning projects. In either case, a model will give good results on a training dataset, but fail when applied to a new dataset.
Q10. What are the ethical issues that a data scientist must always consider?
Privacy-related issues are some of the big ethical issues that a data scientist has to deal with. Data scientists sometimes get access to personal or private information. Generally, data is masked before it is used for data mining, but sometimes private data gets through.
Another ethical issue that data scientists must consider is how the results of a data science project will be used. Will it be used to provide a better product or service to customers? Will it be used for a nefarious purpose?