On The Future of Vector Databases. Interview with Charles Xie
“Open source is reshaping the technological landscape, and this holds particularly true for AI applications. As we progress into AI, we will witness the proliferation of open-source systems, from large language models to advanced AI algorithms and improved database systems.“
Q1. What is your definition of a Vector Database?
Charles Xie: A vector database is a cutting-edge data infrastructure designed to manage unstructured data. When we refer to unstructured data, we specifically mean content like images, videos, and natural language. Using deep learning algorithms, this data can be transformed into a novel form that encapsulates its semantic representation. These representations, commonly known as vector embeddings or vectors, signify the semantic essence of the data. Once these vector embeddings are generated, we store them within a vector database, empowering us to perform semantic queries on the data. This capability is potent because, unlike traditional keyword-based searches, it allows us to delve into the semantics of unstructured data, such as images, videos, and textual content, offering a more nuanced and contextually rich search experience.
Q2. Currently, there are a multitude of vector databases on the market. Why do they come in so many versions?
Charles Xie: When examining vector database systems, disparities emerge. Some, like Chroma, adopt an embedded system approach akin to SQLite, offering simplicity but lacking essential functionalities like scalability. Conversely, systems like PG Vector and Pinecone pursue a scale-up approach, excelling in single-node instances but limiting scalability.
As a seasoned database engineer with over two decades of experience, I stress the complexity inherent in database systems. A systematic approach is vital when assessing these systems, encompassing components like storage layers, storage formats, data orchestration layers, query optimizers, and execution engines. Considering the rise of heterogeneous architectures, the latter must be adaptable across diverse hardware, from modern CPUs to GPUs.
From its inception, Milvus has embraced heterogeneous computing, efficiently running on various modern processors like Intel and AMD CPUs, ARM CPUs, and Nvidia GPUs. The integration extends to supporting vector processing AI processes. The challenge lies in tailoring algorithms and execution engines to each processor’s characteristics, ensuring optimal performance. Scalability, inevitable as data grows, is a crucial consideration addressed by Milvus, supporting both scale-up and scale-out scenarios.
As the vector database gains prominence, its appeal to vendors stems from its potential to reshape data management. Therefore, transitioning to a vector database necessitates evaluating its criticality to business functions and anticipating data volume growth. Milvus stands out for both scenarios, offering consistent, optimal performance for mission-critical services and remarkable cost-effectiveness as data scales.
Q3. In your opinion when does it make sense to transition to a pure vector database? And when not?
Charles Xie: Now, let’s delve into the considerations for transitioning to a pure vector database. It’s crucial to clarify that a pure vector database isn’t merely a traditional database with a vector plugin; it’s a purposefully designed solution for handling vector embeddings.
There are two key factors to weigh. Firstly, assess whether vector computing and similarity search are critical to your business. For instance, if you’re constructing a RAG solution integral to millions of users daily and forming the core of your business, the performance of vector computing becomes paramount. In such a situation, opting for a pure vector database system is advisable. It ensures consistent, optimal performance that aligns with your SLA requirements, especially for mission-critical services where performance is non-negotiable. Choosing a vector database system guarantees a robust foundation, shielding you from unforeseen surprises in your regular database services.
The second crucial consideration is the inevitable increase in data volume over time. As your service runs for an extended period, the likelihood of accumulating larger datasets grows. With the continuous expansion of data, cost optimization becomes an inevitable concern. Most pure vector database systems on the market, including Milvus, deliver superior performance while requiring fewer resources, making them highly cost-effective.
As your data volume escalates, optimizing costs becomes a priority. It’s common to observe that the bills for vector database services grow substantially with the expanding dataset. In this context, Milvus stands out, showcasing over 100 times more cost-effectiveness than alternatives such as PG Vector, OpenSearch, and other non-native web database solutions. The cost-effectiveness of Milvus becomes increasingly advantageous as your data scales, making it a strategic choice for sustainable and efficient operations.
Q4. What is the initial feedback from users of Vector Databases?
Charles Xie: Reflecting on our beginnings six years ago, we focused primarily on catering to enterprise users. At the time, we engaged with numerous users involved in recommendation systems, e-commerce, and image recognition. Collaborations with traditional AI companies working on natural language processing, especially when dealing with substantial datasets, provided valuable insights.
The predominant feedback we received emphasized the enterprise sector’s specific needs. These users, being enterprises, possessed extensive datasets and a cadre of proficient developers. They emphasized deploying a highly available and performant vector database system in a production environment, a requirement often seen in large enterprises where AI was gaining traction.
It’s important to note that independent AI developers were not as prevalent during that period. AI, being predominantly in the hands of hyper-scalers and large enterprises, meant that the cost of developing AI algorithms and applications was considerably high. Around six years ago, hyper-scalers and large enterprises were the primary users of vector database systems, given their capacity to afford dedicated teams of AI developers and engineers. This context laid the foundation for our initial focus and direction.
In the last two years, we’ve witnessed a remarkable shift in the landscape of AI, marked by the breakthrough of modern AI, particularly the prominence of large language models. Notably, there has been a significant surge in independent AI developers, with the majority comprising teams of fewer than five individuals. This starkly contrasts the scenario six years ago when the AI development scene was dominated by large enterprises capable of assembling teams of tens of engineers, often including a cadre of computer science PhDs, to drive AI application development.
The transformation is striking—what was once the exclusive realm of well-funded enterprises can now be undertaken by small teams or even individual developers. This democratization of AI applications marks a fundamental shift in accessibility and opportunities within the AI space.
Q5. Will semantic search be performed in the future by ChatGPT instead of using vectors and a K-nearest neighbor search?
Charles Xie: Indeed, the foundation models we encounter, such as Chat GPT and vector databases, share a common theoretical underpinning—the embedding vector abstraction. Both Chat GPT and vector database systems leverage embedding vectors to encapsulate the semantic essence of the underlying unstructured data. This shared data abstraction allows them to make sense of the information and perform queries effectively. Across large language models, AI models, and vector database systems, a profound connection exists rooted in the utilization of the same data abstraction—embedding vectors.
This connection extends further as they employ identical metrics, primarily relying on distance metrics like Euclidean or cosine distance. Whether within Chat GPT or other large language models, using consistent metrics facilitates the measurement of similarities among vector embeddings.
Theoretically, a profound connection exists between large language models like Chat GPT and various vector databases, stemming from their shared use of embedding vector abstraction. The workload division between them becomes apparent—they both excel at performing semantic and k-nearest neighbor searches. However, the noteworthy distinction lies in the cost efficiency of these operations.
While large language models and vector databases tackle the same tasks, the cost disparity is significant. Executing semantic search and k-nearest neighbor search in a vector database system proves to be approximately 100 times more cost-effective than carrying out these operations within a large language model. This substantial cost difference prompts many leading AI companies, including OpenAI, to advocate for using vector databases in AI applications for semantic search and k-nearest neighbor search due to their superior cost-effectiveness.
Q6. There seems to be a need from enterprises to have a unified data management system that can support different workloads and different applications. Is this doable in practice? If not, is there a risk of fragmentations of various database offerings?
Charles Xie: No, I don’t think so. To illustrate my point, let’s consider the automobile industry. Can you envision a world where a single vehicle serves as an SUV, sedan, truck, and school bus all at once? This has yet to happen in the last 100 years of the automobile industry, and if anything, the industry will be even more diversified in the next 100 years.
It all started with the Model T; from this, we witnessed the birth of a great variety of automobiles commercialized for different purposes. On the road, we see lots of differences between SUVs, trucks, sports cars, and sedans, to name a few. A closer look at all these automobiles reveals that they are specialized and designed for specific situations.
For instance, SUVs and sedans are designed for family use, but their chassis and suspension systems are entirely different. SUVs typically have a higher chassis and a more advanced suspension system, allowing them to navigate obstacles more easily. On the other hand, sedans, designed for urban areas and high-speed driving on highways, have a lower chassis for a more comfortable driving experience. Each design serves a specific goal.
Looking at all these database systems, we see that many design goals contradict each other. It’s challenging, if not impossible, to optimize a design to meet all these diverse requirements. Therefore, the future of database systems lies in developing more purpose-built and specialized ones.
This trend is already evident in the past 20 years. Initially, we had traditional relational database systems. Still, over time, we witnessed the emergence of big data solutions, the rise of NoSQL databases, the development of time series database systems, graph database systems, document database systems, and now, the ascent of vector database systems.
On the other hand, certain vendors might have an opportunity to provide a unified interface or SDK to access various underlying database systems—from vector databases to traditional relational database systems. There could be a possibility of having a unified interface.
At Milvus, we are actively working on this concept. In the next stage, we aim to develop an SQL-like interface tailored for vector similarity search in vector databases. We aim to incorporate vector database functionality under the same interface as traditional SQL, providing a unified experience.
Q7. What does the future hold for Vector databases?
Charles Xie: Indeed, we are poised to witness an expansion in the functionalities offered by vector database systems. In the past few years, these systems primarily focused on providing a single functionality: approximate nearest neighbor search (ANN search). However, the landscape is evolving, and in the next two years, we will see a broader array of functionalities.
Traditionally, vector databases supported similarity-based search. Now, they are extending their capabilities to include exact search or matching. You can analyze your data through two lenses: a similarity search for a broader understanding and an exact search for detailed insights. By combining these two approaches, users can fine-tune the balance between obtaining a high-level overview and delving into specific details.
Obtaining a sketch of the data might be sufficient for certain situations, and a semantic-based search works well. On the other hand, in situations where minute differences matter, users can zoom in on the data and scrutinize each entry for subtle features.
Vector databases will likely support additional vector computing workloads, such as vector clustering and classification. These functionalities are particularly relevant in applications like fraud detection and anomaly detection, where unsupervised learning techniques can be applied to cluster or classify vector embeddings, identifying common patterns.
Q8. And how do you believe the market for open source Vector databases will evolve?
Charles Xie: Open source is reshaping the technological landscape, and this holds particularly true for AI applications. As we progress into AI, we will witness the proliferation of open-source systems, from large language models to advanced AI algorithms and improved database systems. The significance of open source extends beyond mere technological innovation; it exerts a profound impact on our world’s social and economic fabric. In the era of modern AI, with the dominance of large language models, open-source models and open-source vector databases are positioned to emerge victorious, shaping the future of technology and its societal implications.
Q9. In conclusion, are Vector databases transforming the general landscape, not just AI?
Charles Xie: Indeed, vector databases represent a revolutionary technology poised to redefine how humanity perceives and processes data. They are the key to unlocking the vast troves of unstructured data that constitute over 80% of the world’s data. The promise of vector database technology lies in its ability to unleash the hidden value within unstructured data, paving the way for transformative advancements in our understanding and utilization of information.
Charles Xie is the founder and CEO of Zilliz, focusing on building next-generation databases and search technologies for AI and LLMs applications. At Zilliz, he also invented Milvus, the world’s most popular open-source vector database for production-ready AI. He is currently a board member of LF AI & Data Foundation and served as the board’s chairperson in 2020 and 2021. Charles previously worked at Oracle as a founding engineer of the Oracle 12c cloud database project. Charles holds a master’s degree in computer science from the University of Wisconsin-Madison.
On Zilliz Cloud, a Fully Managed AI-Native Vector Database. Q&A with James Luan. ODBMS.org,JUNE 15, 2023
On Vector Databases and Gen AI. Q&A with Frank Liu. ODBMS.org, DECEMBER 8, 2023