On Using Vector Search. Q&A with Tom Dyar and Dirk van Hyfte

by Roberto Zicari · Published May 8, 2024 · Updated May 8, 2024

Q1: You recently announced the addition of vector search to the InterSystems IRIS data platform. What are the benefits of this addition to the InterSystems IRIS data platform?

Thomas Dyar: By integrating vector search into InterSystems IRIS, we can significantly enhance our data handling capabilities, providing a powerful tool for semantic searches. This means that users can find information based on meanings rather than just keywords, revolutionizing the way we interact with text. What’s even more remarkable is that vector search extends beyond text and can be applied to images and various data types. As long as you have an embedding model trained to convert your data into a dense vector, the power of vector search is at your disposal. The outcome? Technology designed for speed and more accurate information retrieval, unlocking deeper insights and empowering smarter decision-making across diverse industries.

Q2: What kind of next-generation AI applications do you plan to help with vector search?

Thomas Dyar: Vector search opens new possibilities for AI applications, especially in the realms of semantic search and Retrieval-Augmented Generation (RAG). We’re enabling applications that can understand user queries more deeply and fetch information that’s not just relevant but contextually accurate. InterSystems traditionally focuses on high-performance mission-critical settings, and for AI this is no different – in addition to supporting general RAG applications, our platform also gives developers granular access to the storage engine for custom applications that require it.

Q3: Specifically, how does vector search help enabling efficient and accurate retrieval of relevant information from massive datasets when working with large language models?

Thomas Dyar: Vector search transforms the challenge of navigating massive datasets from daunting to manageable. By encoding text into a vector space, we’re able to compare and retrieve information based on conceptual similarity, not just spelling matches.This means significant potential for more relevant results in a fraction of the time, a critical advantage when working with large language models and extensive databases. We will soon be releasing a high performance approximate nearest neighbor index (ANN), which is designed to ensure scalability to billions of vectors.

Q4: How does vector search relate to natural language processing (NLP), text, and image analysis?

Thomas Dyar: As I alluded at the outset, the meaning of vectors depends upon the embedding models that generate the vectors from the raw data, and therefore Vector search is integral to advancing our capabilities in natural language processing (NLP), text, image analysis and omics data. By translating complex data into a format that machines can intuitively understand, we’re enhancing our ability to extract meaning, sentiment, and patterns from text and visual content. This not only is aimed at improving the accuracy of our analyses but also opens the door to innovative applications in content discovery, automated summarization, and beyond.

Q5: What benefits does vector search technology bring to customers? Which of your customers do you believe is going to use Retrieval-Augmented Generation (RAG) to develop generative AI-based applications in healthcare?

Thomas Dyar: For our customers, vector search technology opens doors to transformative advantages. It lays the groundwork for the development of smarter, more adaptive applications capable of catering to user needs and contexts with agility. Particularly in the Healthcare and Life Science sector, this innovation empowers clients to craft applications harnessing Retrieval-Augmented Generation (RAG) for generative AI, potentially revolutionizing patient care, diagnostics, and treatment planning by granting access to pertinent medical knowledge like never before. Our application partner BioStrand uses this vector search functionality to help facilitate precise and contextually rich searches across extensive repositories of unstructured biological data. This encompasses textual data from patient notes and scientific papers, alongside a vast repository of sequential omics data such as DNA, RNA, and proteins. Clinicians and researchers can now unearth insights that were previously elusive, paving the way for deeper comprehension and more informed decision-making in healthcare and drug development. This expedites the drug discovery process, shortening the timeline from initial discovery to clinical trials and potentially accelerating the introduction of new treatments to market.

Q6: BioStrand is part of the InterSystems Innovation Program that helps start-ups build applications on InterSystems IRIS. What is this program useful for?

Dirk van Hyfte: The InterSystems Innovation Program is tailored to empower startups like BioStrand, equipping them with the technological foundation and support necessary to realize their innovative visions. This comprehensive program grants access to our advanced data platform, offers technical guidance, and provides scalable infrastructure, enabling startups to effectively tackle complex data challenges.

The partnership between BioStrand and InterSystems, leveraging the InterSystems IRIS data platform and its vector search capabilities, holds significant strategic importance for BioStrand’s objectives in drug discovery and development. Access to InterSystems IRIS enhances BioStrand’s capabilities and effectiveness in several key ways.

Firstly, InterSystems IRIS offers advanced data analysis capabilities, including vector search, allowing BioStrand to analyze complex biological datasets more effectively. This empowers BioStrand to extract valuable insights and identify patterns within vast repositories of unstructured biological data, aiding in deeper understanding of disease mechanisms, therapeutic target identification, and drug development optimization.

Moreover, by leveraging InterSystems IRIS and its vector search capabilities, BioStrand can streamline the drug discovery process. The precise and semantically rich searches, facilitated by vector search helps enable BioStrand to identify promising candidate molecules and therapeutic targets more efficiently, accelerating the discovery phase and reducing resource investments.

Additionally, the partnership with InterSystems provides BioStrand access to a broader ecosystem of collaborators and stakeholders within the Healthcare and Life Sciences sectors. InterSystems IRIS serves as a platform for collaboration, allowing seamless integration of BioStrand’s technologies with existing infrastructure and workflows, fostering collaboration between researchers, developers, and healthcare professionals.

Furthermore, InterSystems IRIS offers scalability and performance capabilities crucial for handling large volumes of biological data. As BioStrand’s research and development efforts expand, access to a scalable and high-performance platform like InterSystems IRIS helps ensure effective management and analysis of increasing data volumes without compromising efficiency or security.

Through the partnership with InterSystems, BioStrand enhances its capabilities in drug discovery and development, aligning with its ambitious goals. Leveraging advanced data analysis tools and streamlined collaboration processes, BioStrand aims to expedite the drug discovery timeline and ensure scalability and performance. This collaboration underscores BioStrand’s commitment to innovation and its dedication to advancing Healthcare and Life Sciences.

Q7: What is the core business of BioStrand?

Dirk van Hyfte: BioStrand is leading the way in AI-driven drug discovery, utilizing advanced computational techniques to streamline the identification and enhancement of novel drug compounds. With a clear mission, BioStrand is dedicated to developing technology that allows safer and more effective drug development that precisely target a wide range of medical conditions.

One of the key challenges faced by BioStrand’s pharmaceutical partners is the management of disparate and isolated data accumulated over time. The lack of data integration poses a significant obstacle in the landscape of biotherapeutic drug discovery and development.

In this realm, data exists in various forms. There is unstructured data, which includes textual information sourced from patient notes and scientific papers, as well as extensive sequential omics data such as DNA, RNA, and proteins. Structured data encompasses clinical lab records and their associated metadata. Additionally, there is contextual knowledge residing within probabilistic models of Large Language Models (LLMs), which provide valuable insights into protein-protein interactions, solvability, and protein developability.

By addressing the complexities of integrating and leveraging this diverse data landscape, BioStrand is striving to redefine the process of drug discovery and development. The goal is to enable more efficient, targeted, and successful pharmaceutical interventions that effectively address critical medical needs.

Q8: What kind of technology do you use to identify and craft novel drug compounds, and reduce R&D timelines from development to commercialization?

Dirk van Hyfte: BioStrand employs a sophisticated combination of AI, machine learning, and bioinformatics to revolutionize the drug discovery process. This unique approach enables rapid compound screening, predictive modeling of drug efficacy, and optimization of drug design, aimed at streamlining R&D efforts.

At the heart of this technological innovation lies the Universal Foundation Model, which acts as a bridge, bringing together collective knowledge from diverse sources such as sequence information, structural data, and scientific literature. By harnessing the power of Large Language Models (LLMs) and neural networks, and utilizing HYFT as connector components, BioStrand can logically connect and seamlessly integrate these building blocks.

However, BioStrand’s pioneering approach doesn’t stop at integrating various LLMs and neural networks. It takes a step further by representing these models through embeddings. This means that the wealth of knowledge captured by these LLMs is transformed into vector representations, enhancing the efficiency and effectiveness of data analysis. Leveraging vector search capabilities on these embeddings enables unparalleled exploration and retrieval of relevant information within vast datasets. This technological synergy not only helps accelerate the pace of drug development but also opens up new horizons for generative AI in the life sciences field. With BioStrand’s advanced capabilities, researchers can unlock hidden patterns, identify promising drug candidates, and ultimately revolutionize the way we approach healthcare and pharmaceutical innovation.

Q9: Do you already use InterSystems IRIS data platform? If yes, how do you plan to use the new vector search addition?

Dirk van Hyfte: BioStrand’s Lense^ai applications continuously expand its capabilities by leveraging the InterSystems IRIS data platform. The introduction of vector search technology further enhances their potential, enabling the efficient mining of vast amounts of research articles and biomedical data. This not only helps improve the accuracy of their predictive models but also is designed for accelerated ~~the~~ identification of viable drug candidates.

In their pursuit of target discovery, BioStrand utilizes a combination of NLP (Natural Language Processing), Retrieval-Augmented Generation (RAG), and vector search. Their goal is to establish the causal connections between diseases and targets. By harnessing the power of the RAG system and vector search, BioStrand encodes text into vector space, allowing for conceptual similarity-based comparisons and retrieval of information. This approach surpasses simple spelling-based word matches, and is designed to deliver~~ing~~ more relevant results in a significantly shorter timeframe. Such capabilities prove particularly advantageous when dealing with extensive databases and large language models.

Another essential application of BioStrand’s NLP techniques and vector search is in the identification and understanding of adverse effects, with a specific focus on anti-drug antibodies (ADAs). ADAs have the potential to hinder treatment effectiveness and lead to adverse effects, yet they are often underreported in clinical trials, necessitating standardized reporting. To address this challenge, BioStrand employs concept embeddings, which visually represent the semantic relationships between ADA-related concepts. This visualization aids in better understanding the patterns associated with ADAs. By leveraging NLP techniques, BioStrand strives to enhance the safety and efficacy of clinical antibody therapy, thereby advancing medical science for the benefit of patients.

Central to the success of BioStrand’s Foundation AI Model is its utilization of its patented HYFT technology, a sophisticated framework designed to identify and leverage universal fingerprint™ patterns across the biosphere. These fingerprints act as critical anchor points, encompassing detailed information layers that bridge sequence data to structural data, functional information, bibliographic insights, and beyond, serving as the great connector between disparate realms of knowledge.

BioStrand’s patented HYFT technology is integrated into the HIT Expansion pipeline, particularly in the realm of protein and antibody analysis. By utilizing protein Large Language Model (LLM) embeddings, BioStrand enables functional antibody searches at both the sequence and subsequence level. The embedding-level enrichment broadens the search dimensions by capturing hidden features beyond sequence and structure, including physiochemical, functional, and target-binding relevant properties. The ability to search at the subsequence/HYFT level facilitates the detection of functional patterns, uncovering critical evolutionary significance. Moreover, the multi-layered design, involving a stacking of different protein LLMs, enhances the understanding of subtle variations in protein functionality and evolution.

Through these innovative approaches, BioStrand is at the forefront of harnessing the power of NLP, RAG, and vector search technologies. Their advancements in target discovery, adverse effect identification, and protein analysis not only revolutionize the drug discovery process but also pave the way for significant breakthroughs in healthcare and pharmaceutical innovation.

Qx: Anything else you wish to add?

Thomas Dyar and Dirk van Hyfte: As we stand on the brink of these exciting developments, it’s clear that technologies like vector search are not merely enhancements to our data platforms but fundamental drivers of innovation across industries. Making our customers successful through simplified development and deployment of AI applications motivates our continued commitment to advancing our technologies.

…………………………………….

Thomas Dyar Product Manager Machine Learning at InterSystems. innovating at the intersections of data science, health IT and biomedical fields. I conceptualize, define and build solutions to mission critical problems using machine learning and related techniques, drawing on deep technical background in data science and scientific programming in academic and industry roles.

Dirk van Hyfte, CTO and Co- founder, BioStrand (IPA).

Visionary serial entrepreneur with the unique capacity to push technology into new frontiers. Dirk develops and applies ground-breaking multi-disciplinary concepts to shift paradigms both in the minds of people he works with and the organizations he consults. He pushes concepts in pioneering beyond what is technically possible by fostering innovation and delivering extraordinary technological breakthroughs and solutions. Dirk reinvented himself several times and merged the expertise as a trained psychiatrist with a Ph.D. in Medical Informatics and Artificial Intelligence with his solid business skills and Enterprise Software knowledge. His first big success was i.Know, providing unique text analytic technology for both structured and unstructured data. i.Know was acquired by InterSystems in 2010 to help further the AI-capabilities of their data-platform.

Resources

InterSystems Innovation Program

InterSystems expands the InterSystems IRIS data platform with Vector Search to support next-generation AI applications, March 26, 2024 | By Scott Gnau, Vice President of Data Platforms

IRIS data platform with Vector Search. Early Access Program

On Using Vector Search. Q&A with Tom Dyar and Dirk van Hyfte

You may also like...

Resources

Search

News

Events

Archives

Sponsored By

InterSystems

MySQL/Oracle

Supporters

McObject

Raima

Scality

TIAA

Undo

Volt Active Data