On Machine Learning, NLP and Ludwig. Q&A with Piero Molino, Senior Research Scientist, Uber AI
Q1. Uber AI has developed a tool (1) that uses machine learning and natural language processing (NLP) techniques to help agents deliver better customer support. What was the main motivation for this project?
The motivation is that we wanted to improve the experience of our customers when they reach out for support by providing faster and more accurate responses and on the other hand we decided to do it by providing a tool that helps out customer support representatives by providing them suggestions that speed up up and improves their decision making, making them faster and more accurate, while at the same time reducing costs, as the same number of representatives become capable of solving more tickets.
Q2. The tool has been deployed leveraging Michelangelo (2), an internal ML-as-a-service platform. What did Uber develop Michelangelo and did not use an existing ML data platform?
Before Michelangelo various machine learning deployment and training processes were happening at the same time within the company and the main goal of Michelangelo as a platform was to streamline those processes reducing the friction in training and deploying models at scale. The specific requirements, in terms of infrastructure, interoperability with the rest of our data ecosystem and capabilities like latency and uptime led to develop it in-house.
Q3. COTA uses machine learning and natural language processing (NLP) techniques to help agents deliver better customer support. What specific technical challenges did you face and how did you solve them?
We had to face many challenges. The most peculiar ones are probably the structure that is intrinsic in the learning tasks, together with the nature of the ever changing data distribution. With respect to the former one, the model had to perform different predictions at the same time, and some of those have a hierarchical structure in their set of labels.
This led to two of the main innovations in our model: we treated the prediction of labels in a hierarchy as the prediction of a sequence of decisions in a hierarchy, starting from the root, so that the model predicts paths from the root of the hierarchy to its leaves rather than predicting leaf labels directly.
The other interesting innovation was the definition of dependencies between the different predictions, so that predictions made by the model for one prediction affect the dependent ones. Both these innovations led to better results in prediction accuracy. As for the change in distribution, issues in customer support change over time, either because some root causes are identified and solved or for a matter of sesonality. We addressed this issue performing an extensive study on the amount and age of historical data needed for train our models, which revealed that training on too old data actually decreases the performance of the model, and by analyzing how much the distribution shift over time influenced our model performance, which led to the definition of an incremental retraining strategy that balances performance and frequency of retraining.
Q4. For the implementation of your use case, you first decided to build an NLP model that analyzes text at the word-level to better understand the semantics of text data. What experience do you have in using NLP?
I personally have more than 7 years experience in adopting cutting edge NLP techniques in effective industrial application.
The experience matured in this specific project is that the basic assumption that more data is always better doesn’t hold for real world NLP application, in particular the ones dealing with user generated content, for the aforementioned phenomenon of shift in distribution. At the same time, the other conlusion that I can draw is that, in particular if the size of the data is big enough (and in our case it was, with millions of tickets per week), also in real world NLP applications deep learning models can provide a significant performance boost.
Q5. Why not using directly Deep Learning for that?
Real world machine learning is a matter of compromise, it isn’t always the case that the best model gets deployed to production, other factors like latency, infrastructure constraints, amount of effort and time needed to procide a specific solution always play a role. Also within the several deep learning architectures we tested, we ended up deploying the second most accurate for latency reasons. But to get back to your question, my role in the project as a research scientist was exactly to experiment offline with deep learning models to figure out if they could provide a big enough gain to justify the investment in switching to use them. It was an exercise in figuring out what works empirically and adopt the most reasonable solution in a purely data driven way.
Q6. In a second phase, you implemented several architectures based on convolutional neural networks (CNNs), recurrent neural networks (RNNs), and different combinations of the two, including hierarchical and attention-based architectures.
Can you tells us what are the main lessons you have learned out of this piece of work?
The main learning is that, in our use-case and dataset at least, the difference in terms of performance between different architectures is not marked, and consequently more efficient CNN based architectures were chosen to be implemented.
This finding is in-line with recent findings in the literature. Another important finding for us was the value of a well structured experimental setting that allows for fair comparison among models, reproducibility and analysis of the results. Without it it would have been difficult to do the same amount of progress at the same speed.
Qx. Anything else you wish to add?
The experience matured in this project led us to the development of a tool for experimenting with different deep learning architectures that allows its users to train models for a wide variety of tasks (NLP, Vision, Sequnce and time-series predictions and much more) and use them to perform predictions.
The tool, called Ludwig, has recently been open-sourced and allows anyone to build their deep learning models without the need for deep machine learning skills and without the need to do any coding. We used it in our offline experiments withing COTA and in countless other use cases within Uber.
We hope the tool will be useful to the wider machine learning commpunity and we made it easily extensible to allow community involvement in its development.
——————————
Piero Molino is a Senior Research Scientist at Uber AI with focus on machine learning for language and dialogue. Piero completed a PhD on Question Answering at the University of Bari, Italy. Founded QuestionCube, a startup that built a framework for semantic search and QA. Worked for Yahoo Labs in Barcelona on learning to rank, IBM Watson in New York on natural language processing with deep learning and then joined Geometric Intelligence, where he worked on grounded language understanding. After Uber acquired Geometric Intelligence, he became one of the founding members of Uber AI Labs.
Resources
– Ludwig (github): Ludwig is a toolbox that allows to train and test deep learning models without the need to write code.
– Introducing Ludwig, a Code-Free Deep Learning Toolbox , Piero Molino, Yaroslav Dudin, and Sai Sumanth Miryala, Uber Engineering –Article that describes the tool.