On building a NLP pipeline. Q&A with Alex Mikhalev
Q1. You participated in 2020 in a RedisConf hackathon where you built an NLP pipeline. What was the purpose of the project?
The medical profession has put a lot of effort into collaboration, starting from Latin as a common language to industry-wide thesauruses like UMLS. Yet it is full of scandals where publication in a prestigious medical journal can be retracted and yet the World Health Organisation would have changed its policy advice based on the published article. I think a paper claiming that “eating a bat-like Pokémon sparked the spread of COVID-19″ takes the prize. One would say that editors in those journals don’t do their job, and while it may seem true, I would say they had no chance. The number of articles published about COVID (SARS-V) is passing 300+ per day. We need better tools to navigate the flood of information.
When I am exploring topics on science or engineering, I look at the diversity of the opinion, not the variety of the same cluster of words or thoughts. I want to avoid confirmation bias. I want to find articles relevant to the same concept, not necessarily the ones which have similar words. My focus is to build a natural language processing pipeline capable of handling a large number of documents and concepts by incorporating System 1 AI (fast, intuitive reasoning) and System 2 (high-level reasoning), and then present knowledge in a modern VR/AR visualisation.
Search or information exploration should be spatial, preferably in VR (memory palace, see Theatre of Giulio Camillo). A force-directed graph is a path towards it, where visuals are assisted by text. The relevant text pops up on the connection where people can explore the concepts, and then dig deeper into text. The purpose of the pipeline is that knowledge should be re-usable and shareable, hence “The Pattern” – the tool to help navigate information complexity in the modern world. It’s my fun out-of-office hours project with no connection to my day job.
Q2. Is the system in production by now? If yes, what are the lessons learned so far?
My production is a single demo server for now https://thepattern.digital/. It holds 100GB RedisGraph and has another Redis Cluster for processing with shards of 120GB data. There are many lessons learned. For example, Redis is very memory efficient for data processing, even using quite heavy machine learning libraries, and second, is that maintaining one’s own infrastructure is time consuming.
Q3. How do you measure if your system is helping medical professionals navigate through medical literature?
I am in conversation with The Oxford University (UK) Praxis Forum, a collaborative forum for experienced researchers and practitioners, to see if we can use “The Pattern” for one of their workshops. The initial idea of the project was conceived as an “Engelbards demo” augmenting human intelligence using modern AI and VR tools, and demonstrating new concepts to assist knowledge exploration, presentation, and re-use on community level. Now I am checking if the medical community will be open using the tool.
Q4. What are the key technical challenges in incorporation of modern machine learning modules into the NPL pipeline for Question Answering (the chatbot)?
In building chatbots there are many challenges. One is to build a dialogue system and another is to make bots understand the domain area. This is what I focused on in “The Pattern” project. I wanted to be able to leverage medical information and Metathesaurus -– created and maintained by thousands of people.
Q5. You mentioned in one of your presentations that while creating a BERT-based QA service is quite straightforward, API response time wasn’t good. Why?
First, we need to understand how BERT QA models work by providing answers from the piece of text known as context. For example a line of text, “This would need tight coordination among pharmaceutical companies, governments, regulatory agencies, and the World Health Organization (WHO), as well as novel and out-of-the-box approaches to cGMP production, release processes, regulatory science, and clinical trial design,” and have a question, “What is the effectiveness of community contact reduction?,” the BERT model will produce an answer, “This would need tight coordination.”
One answer from one context is straightforward and will take about ~1.4 seconds on modern hardware. Now if we have 55 thousand articles with millions of potentially relevant contexts we need to pre-process those articles before running the QA model. The common way is to use traditional search engine techniques and rank using TFIDF or BM25 algorithms. There is a great write up here on “Open-Domain Question Answering System.” What I wanted to do differently is that my domain is medical so I wanted to add experts’ knowledge into rankings and used RedisGraph to rank contexts before Question/Answering. I think graph-based techniques are promising directions, with graph embeddings and graph-based machine learning becoming more well known.
Q6. Why did you choose Redis Labs? and What are the main benefits of using Redis Labs?
I think there is an ever-growing demand for making AI/ML products being practically useful. If I look at the default ML data science pipeline it has a very slim chance of making it into production, particularly if I take an engineering or architecture lens on it. A lot of the examples and frameworks in the data science community are built around batch processes and this is where a challenge lies. You have to rethink or rewrite your solution completely if you want to take the solution from being batch-driven into the world of event-driven processes, where most businesses nowadays are event-driven. I would like to help to bridge that gap between pure data science and engineering thinking and this is where Redis can fit. Redis Labs created real-time, memory efficient components which can be used to build high performance NLP pipelines, which I am using quite successfully.
Q7. What is your experience so far in using the Redis ecosystem?
Very positive, I really like their openness and responsiveness. I think it’s one of their best assets to see discussion between core team engineers and CTO/product owners on Github and the responses are very inspiring.
Q8. Can you share with us some details of the implementation?
All code is open source and I expanded the project during the latest RedisConf Hackathon 2021: https://github.com/applied-knowledge-systems/the-pattern. I use RedisGears quite extensively to process data using very short snippets of code, Redis Cluster for data sharding and effective multiprocessing, RedisGraph for graph storage and query ranking, and RedisAI for inferencing. The project also has a 3D/VR interface which is written by Brian Wachanga and we are working on hand tracking for VR for knowledge graph interaction.
Q9. Anything you wish to add?
Thank you for giving me the opportunity to contribute and share my experience. I also have an article on Redis Labs’ website: Building a Pipeline for Natural Language Processing using RedisGears and a talk at RedisConf 2021 “How to deploy ML into RedisAI.” Anyone is free to contribute https://github.com/applied-knowledge-systems/the-pattern –– let’s keep coding fun!
I am a systems thinker with a deep understanding of technology and a methodological approach to innovation. Over the last 20 years, I held multiple engineering, academic and leadership positions. I have a systematic approach to innovation, which allows me to innovate and inspire others to innovate: for example, in 2018, my team build a “blockchain-inspired” distributed system which resulted in the first technology patent in the history of Nationwide Building Society. I have deep knowledge and interests in data privacy, synthetic data, distributed data and natural language processing/search engines, sensors, wired and wireless networks. My unique combination of skills allows me to take novel ideas from inception into production.