Q&A with Data Scientists: Jeff Saltz

Jeff Saltz is currently an Associate Professor at Syracuse University, in the School of Information Studies. His research and teaching focus on helping organizations leverage information technology and data for competitive advantage. Specifically, Jeff’s current research focuses on the socio-technical aspects of data science projects, such as how to coordinate and manage data science teams. In order to stay connected to the “real world”, Jeff consults with clients ranging from professional football teams to Fortune 500 organizations and has recently authored an “introduction to data science” book.
 
 Q1. Is domain knowledge necessary for a data scientist?
 I think domain knowledge is very important to the data science team. So, either the data scientist needs to work very closely with a subject matter expert, or personally have the significant domain knowledge. Without this knowledge, it is difficult for a data scientist to generate actionable insight.
 
Q2. What should every data scientist know about machine learning? 
Machine learning is becoming increasingly easy to use – and hence, I do not think all data scientists need to understand how to implement new machine learning algorithms. However, I do think all data scientists should understand the strengths and weaknesses of various techniques that are available and how the parameters impact model behavior.
 
Q3. How do you ensure data quality?
Data quality is a subset of the larger challenge of ensuring that the results of the analysis are accurate or described in an accurate way. This covers the quality of the data, what one did to improve the data quality (ex. remove records with missing data) and the algorithms used (ex. were the analytics appropriate). In addition, it includes ensuring an accurate explanation of the analytics to the client of the analytics. As you can see, I think of data quality is being an integrated aspect of an end-to-end process (i.e., not a check done before one releases the results)
  
Q4. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain?
With respect to being relevant, this should be addressed by our first topic of discussion – needing domain knowledge. It is the domain expert (either the data scientist or a different person) that is best positioned to determine the relevance of the results.
 
However, evaluating if the analysis is “good” or “correct” is much more difficult, and relates to our previous data quality discussion. It is one thing to try and do “good” analytics, but how does one evaluate if the analytics are “good” or “relevant”?  I think this is an area ripe for future research. Today, there are various methods that I (and most others) use. While the actual techniques we use vary based on the data and analytics used, ensuring accurate results ranges from testing new algirhtms with known data sets to point sampling results to ensure reasonable outcomes.
 
Q5. What were your most successful data projects? And why?
My most successful projects were in industry, and as such, I cant go into great detail. I will say that they had a few key attributes in common. First, there was buy-in from a senior sponsor. Second, the team had great insight into the business problem trying to be addressed and finally, acquiring the data was not a huge project in itself (in that sometimes we needed a pre-project to collect the relevant data).
 
Q6. What are the ethical issues that a data scientist must always consider?
We, as data scientists, need to be thinking about ethics much more than I think is typically done. There are many situations where ethics is important and should be factored into the analysis to be done. One simple example is the fact that it is sometimes possible to take multiple anonymized datasets and, via the merging of the data, figure out the person in the data. While technically possible, one should consider the ethics of this activity. Another example is what attributes are appropriate for what type of models. For example, using attributes such as zip codes might not be ethical (or even legal in some situations).

You may also like...