Elena Simperl is a Professor of Computer Science at the University of Southampton in the UK. She leads Southampton’s Data Science MSc programme and is a founding member of the European Data Science Academy. Her research is at the intersection between knowledge technology and crowd computing. It helps system designers understand how to build online socio-technical systems that combine artificial intelligence with human and social capabilities.
Q1. Is domain knowledge necessary for a data scientist?
Domain knowledge is helpful. If a data scientist is working to gather insights for a company in a specific domain, then they should aim to get a firm grasp of the area. On the other hand, if they work on projects in different domains, then being an expert in all of them is not practical. After all, the main job of the data scientist is to work with the data.
In those cases, it might make sense to think about data science teams, bringing together domain experts with data engineers and analytics experts. The data scientist needs to be aware of the fact that analysis results need to be put in context and interpreted to be truly useful. Sometimes they can frame the discussion themselves, other times they might need the help of someone with a deeper understanding of the domain.
Q2. What should every data scientist know about machine learning?
The most important thing every data scientist should know about machine learning is how to evaluate the effectiveness of an algorithm applied to a particular set of data. Further, they should be able to understand what the evaluation means in terms of whether the insight it provides is useful for the purpose it’s intended.
Secondly, the insights you can get are only as good as the data you have. If you have poor data, then no matter what you do, the value you can gain from it will be limited. “Garbage in, garbage out”, as the expression goes.
A data scientist should also have a broad understanding of different machine learning techniques, and be able to choose the most appropriate one to use in a particular circumstance. Even if one has a favoured algorithm they understand really well, it should still be considered in terms of its utility for the particular problem, and be prepared to choose a different algorithm if appropriate. Each algorithm will work best for particular scenarios or will bias the results in a particular way. A data scientist needs to be aware of these details and work around them.
Q3. Can data ingestion be automated?
There can certainly be a level of automation for data ingestion in some domains. Once a set of parameters for the different datasets have been established then custom code or existing libraries can largely automate the process.
On the other hand, in areas such as linked data and open data, the community encourages that data be released, and that this can improve quality of the data. Often data is published on an ad hoc basis, which means it is not possible to rely on it being released in the same way. In addition, there are unfortunately often documents released only in formats such as PDF which is difficult to extract data from accurately. That is not to say the process cannot be automated to some extent, but it does require some additional care for verifying its accuracy.
Q4. How do you ensure data quality?
It is not possible to “ensure” data quality, because you cannot say for sure that there isn’t something wrong with it somewhere. In addition, there is also some research which suggests that compiled data are inherently filled with the (unintentional) bias of the people compiling it. You can attempt to minimise the problems with quality by ensuring that there is full provenance as to the source of the data, and err on the side of caution where some part of it is unclassified or possibly erroneous.
One of the things we are researching at the moment is how best to leverage the wisdom of the crowd for ensuring quality of data, known as crowdsourcing. The existence of tools such as Crowdflower makes it easy to organise a crowdsourcing project, and we have had some level of success in image understanding, social media analysis, and data integration Web. However, the best ways of optimising cost, accuracy or time remain to be determined and are different relative to the particular problem or motivation of the crowd one works with.
Q5. What techniques can you use to leverage interesting metadata?
Metadata enables the enrichment of a dataset by providing additional insights to the data or enabling it to be linked to other similar datasets. One way of doing this is to use RDF, or other Semantic Web technologies, which represent the metadata in a certain way which can be understood by other computers – provided they have access to the vocabulary being used. This usually requires some manual human mapping, in order to ensure that definitions are consistent across domains. The idea of a “Web of Data” is an example of this at a wide scale, where more useful semantic queries can be made.
Q6. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain?
This question links back to a couple of earlier questions nicely. The importance of having good enough domain knowledge comes into play in terms of answering the relevance question. Hopefully a data scientist will have a good knowledge of the domain, but if not then they need to be able to understand what the domain expert believes in terms of relevance to the domain.
The correctness or value of the data then comes down to understanding how to evaluate machine learning algorithms in general, and using domain knowledge to apply to decide whether the trade-offs are appropriate given the domain.
Q7. What are the ethical issues that a data scientist must always consider?
As a general principle, anyone working in data science needs to remember that the insights which they develop will impact real people, and any ethical decisions should be made in that context. Big Web companies in particular can exert significant influence over the way we think and act. People can be assessed strictly on defined metrics, without considering other value which they might bring. Other projects could lead to new insights which cause an entire class of jobs to become obsolete leaving people without a job or career prospects.
The origin and destination of the data is something else which should be considered. Suppose the data was stolen from a company’s database and dumped online. There might be some really interesting insights to be gathered from that, but is it ethical to pore over someone’s private data in this way? For the destination, suppose you develop a set of features which can predict someone’s sexuality from publicly available social networking data. What happens if that gets used in a country where homosexuality is illegal?
That is not to say by any means that we should not make use of data! There will always be progress, and mostly it will be good. But always rigorously consider the ethics and consequences of the advances generally, and for the participants in a specific project.