Q&A with Data Scientists: Jochen Leidner
Dr Jochen Leidner (@jochenleidner) is Director of Research with Thomson Reuters, where he runs the London R&D site, which he set up in 2013.
He has held past Scientist roles with the company in the UK, the US and Switzerland.
He holds a Master’s degree in engineering from the University of Cambridge and a PhD in informatics from the University of Edinburgh. He occasionally teaches language technology and big data at the University of Zurich.
He was awarded an ACM SIGIR Doctoral Consortium Award for his PhD work and a Royal Society of Edinburgh Enterprise Fellowship for his research in mobile question answering.
In 2015 and again in 2016, we was named co-recipient of the Thomson Reuters Intentor of the Year award for the best patent application. His research interests include information extraction, question answering, aspects of search, Information access, applied machine learning, big data and software engineering.
Q1. Is domain knowledge necessary for a data scientist?
Domain knowledge is necessary for a project, but it is not necessary that the data scientist has it from the beginning; rather, the data science team needs to have access to subject matter experts (SMEs) when they need to pick their brains.
Generally, I find it unrealistic how many articles on the Web talk about ‘data scientists’ as a know-it-all super-hero.
We call it the ‘unicorn problem’: no individual will know everything about everything; what you want is to put together a collaborative team of machine learning experts, experimenters (e.g. from experimental psychology), software engineers/architects, big data engineers, domain experts, language experts, statisticians to user experience design gurus.
As each of these aspects develops, they will become so sophisticated that it truly needs one person in each sub-field to have enough expertise to extend the state of the art. If people want to hire a single ‘unicorn’ data scientist for everything, that is likely going to indicate they do not understand well the complexity involved. Over time, such a interdisciplinary team then acquires domain knowledge gradually by engaging with the domain experts regularly to co-create new methods and systems, so everybody gets to acquire some of everyone else’s knowledge, but only to the extend required by the joint work.
Q2. What should every data scientist know about machine learning?
First, I don’t like the term data scientist, because it is redundant. How could you possibly be a scientist without working with data? Have you ever heard of a chef calling themself a ‘food chef’?
Now regarding machine learning, there is quite a lot to learn. The most important thing is experimental methodology around annotation, inter-rater agreement, evaluation metrics. I’m a big proponent of ‘evaluation first’, in other words don’t build any code for a system before thinking (and setting up) the evaluation.
Then there’s the standard toolbox for machine learning practitioners: SVMs, CRFs, MaxEnt, Random Forests, NB and neural nets. Multi-layer nets (AKA deep learning) are relevant but only if you have large training data sets available (or if you can generate them artificially). We add any new method to our toolbox, and ensure we are using a method that makes sense, given what is best in the situation. Besides that, if you automate your pipeline, you can exchange classifiers easily for one another, and determine empirically which one works best on a particular data set.
It would be nice if we had a theory that could predict which classifier to use to get the best results, but our machine learning meta-knowledge is not that far yet.
Q3. What are the most effective machine learning algorithms?
What’s the most effective data? Well, it depends what you are trying to do. Most people use SVMs very successfully for classification tasks. CRFs excel on sequence tagging (unless you have good knowledge about how the sequences are related, then you might want to use something else). The one important thing that should be said is that the learning method is less important than the features used to classify (assuming supervised learning here).
Q4. What is your experience with data blending?
If you take ‘data blending’ to refer to the process of combining data from multiple sources into a functioning dataset on a more ‘ad hoc’ and on-domand basis rather than by copying the data in the same store or joining it together in a database, then it is something that is desirable, but it can be challenging. In a large organization you will always find corners that are well organized (blendable) and others where you need to talk to people first to get access to particular data sets.
I have not played with paritcular tools, but I am aware of some vendors like Pentaho, Tableau, or Datawatch.
Q5. Predictive Modeling: How can you perform accurate feature engineering/extraction?
Start out and use the obvious low-hanging fruit (for text: the words and meta-data of the text), and then think what could be good proxies for your target class. What is highly correlated with the outcome you are trying to classify?
This is perhaps the most creative part of building machine learning models, and intuition and experimental validation need to go hand in hand, as intuition can be a poor guide.
Q6. Can data ingestion be automated?
It can, and it should. The more your research systems are automated, the easier it is to replicate past numbers if you have to go back to an experiment in half a year’s time. Also, the more automated your research pipeline, the quicker you can implement your work in production.
Q7. How do you ensure data quality?
There are a couple of things: first, make sure you know where the data comes from and what the records actually mean.
Is it a static snapshot that was already processed in some way, or does it come from the primary source. Plotting histograms and profiling data in other ways is a good start to find outliners and data gaps that should undergo imputation (filling of data gaps with reasonable fillers). Measureing is key, so doing everything from inter-annotator agreement of the gold data over training, dev-test and test evaluations to human SME output grading consistently pays back the effort.
Q8. When data mining is a suitable technique to use and when not?
Clustering, frequent itemset mining and other data mining methods are good to make recommendations, and also to discover things in data serendipitously; when your task is very clearly defined, and you already know the data well, that is often suggestive supervised learning might be the right approach.
Q9. What techniques can you use to leverage interesting metadata?
If the meta-data groups your data into sub-categories (e.g. news by topic), factorized models can be induced for each topic to deal with a task at hand, for example. Availability of timestamps permit time lines to be constructed. But the most exciting metadata types that I’ve worked with in the past are (1) useage data and (2) geospatial data.
They are both especially valuable in the mobile space.
Q10. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain?
There is nothing quite as good as asking domain experts to vet samples of the output of a system. While this is time consuming and needs preparation (to make their input actionable), the closer the expert is to the real end user of the system (e.g. the customer’s employees using it day to day), the better.
Q11. What were your most successful data projects? And why?
This is a bit like asking a parent to name their favourite child. Also, it depends on whether you mean ‘financially rewarding ‘intellectually satisfying’ or what your metric for success is.
Perhaps the most rewarding experience for a researcher is the participation in a so-called shared task, where research groups benchmark different methods against each other. I found leading several teams in open-domain question answering (e.g. TREC or CLEF) in such shared tasks particularly exciting and rewarding. The reason is that when you meet with the other participants, everyone of them has tried to crack the same problems that you had to, so there are many things worth sharing.
The task of pinpointing a single answe (e.g. a number) in millions of documents has something intrinsically hopeless about it (a real needle in a haystack), so when your system finally gets many answers right, the emotional reward is equally high.
Q12. What are the typical mistakes done when analyzing data for a large scale data project? Can they be avoided in practice?
There’s one mistake that people make that doesn’t get discussed much – that is, to assume something is in the data, and then later it turns out it’s actually not contained in it after all. Care need to be taken about big picture issues like rights management of data (entitlements) or ethics (it is right to do this project?) to more microscopic issues like getting too excited too early about spurious correlations/patterns in the data. Perhaps the most common error, and there seems to be consensus on that in the community, is never underestimate the time the pre-processing/data cleansing will take.
Q13. What are the ethical issues that a data scientist must always consider?
Is the project I am expected to run ethical? If it is okay overall, are there some murky corners, perhaps? Would you like to see your own data used in the application you are developing, or is anyone’s privacy compromised? Are you being transparent with the people whose data it is? If they are individual people, have they consented to it via opt-in? Are there any issues of discrimination, racism? How are errors handled; is there a recourse?
For the technology-excited person it is tempting to jump right into the technicalities, they are full of ideas.
But it is important to do the morally right thing, and it’s great that the research community is now beginning to have this conversation; check out the Workshop on Ethics and NLP http://ethicsinnlp.org/ and FATML Workshops series http://www.fatml.org/ to learn more about this.
If you are interested in ethics and regulation, there will also be a panel on artificial intelligence, ethics and regulation that I will be speaking at part of CeBIT 2017 (March 24), so join this important discourse if you can.
It’s important that we are actively building the ‘right’ tomorrow, which is not going to happen automatically, so let’s not take it for granted.
Thanks for the interview, Roberto.