On Vector Databases. Q&A with Charles Xie.
” Building a vector database at scale is a complex and challenging task. Data volume is one of the most critical factors since the fundamental user requirement is to query at least a billion vectors in milliseconds. And the more data you have, the more complex it becomes to design the underlying systems.“
Q1. What are the challenges of working with unstructured data?
The total amount of digital data generated worldwide is increasing rapidly. Simultaneously, approximately 80% (and growing) of this newly generated data is unstructured data that does not conform to a table- or object-based model. Examples of unstructured data include text, images, videos, protein structures, and geospatial information. The sheer volume and complexity are the biggest challenges of unstructured data. With a predictable structure, storing queries efficiently can be easy.
Q2. What is a vector database? How is it different from a relational database or an object database or a NoSQL database?
Relational databases track inventory, transactions, and large amounts of customer data. This mission-critical data is stored in tables where the relationships are essential to fetch data from multiple tables simultaneously and helps ensure that data in databases are consistent and updated.
Vector databases, on the other hand, are specialized databases for storing unstructured data transformed by a machine-learning model into numeric representations called embeddings and are crucial when building semantic representation in the AI era. Unlike relational databases, there are no direct relationships between data collections in a vector database. Regardless, vector databases execute similarity searches using the Approximate Nearest Neighbor (ANN), a form of proximity search to find a point in a given set closest to a given point.
Furthermore, embeddings make it possible to query and compare complex, unstructured data, like a catalog of digital images, with several features, including image size, colors, patterns, location, etc. Converting it into embeddings makes it possible to store this high-dimensional data in an optimized way to search for similarities, shared geographies, and more.
Q3. Managing massive embedding vectors generated by deep neural networks and other machine learning models is crucial. Why did you choose to build a vector database to store such big data?
More and more applications are taking advantage of the large amounts of unstructured data to differentiate themselves. For example, consider a scenario where hundreds of millions of different products are available. To make this data easily digestible by the user, the developer may decide to build a product recommender in their application. Embeddings are generated from product photos and videos, allowing the searches to yield a nearest-neighbor result that aligns with users’ interests. We chose to build a vector database because it was essential to have a solution built and optimized for the volume and scale of these AI application datasets. When comparing billions of embeddings, performance, and speed is critical, especially as people have become used to instant access to recommendations or information.
Q4. How are these vector embeddings typically generated?
Embeddings are generated by applying a machine-learning model to unstructured data. The model processes the source data and converts all the various properties into vectors to be compared and queried efficiently. Choosing the best machine learning model for the use case is essential. If the model isn’t a good fit for the data type, it can negatively impact the usefulness of the dataset. For example, you shouldn’t use a model optimized for vectorizing text on a library of digital images, just like you wouldn’t translate a book into French for an audience that only reads Finnish.
Q5. Can you store embeddings in a traditional relational or object database or a NoSQL database?
You could but only on a small scale with few dimensions. A large dataset with many dimensions would quickly run into performance issues as you tried to scale it up. So we built a vector database explicitly optimized for massive volumes of vectorized data. Imagine datasets with billions of embeddings. Storing and querying vector data efficiently is what vector databases are designed for.
Q6. What is Milvus? And what is it useful for?
Milvus is a highly flexible, reliable, cloud-native, open-source vector database created by Zilliz in 2018. In January 2020, Zilliz donated Milvus to the LF AI & Data Foundation as an incubation project. Initially designed as a single-instance, local database, Milvus quickly matured as a project and community. It became fully cloud-native in June 2021, the same month it became an LF AI & Data graduated-stage project. Zilliz has been and will remain a key backer of Milvus, which now has 200+ contributors, over 1000+ enterprise users, and 2000+ community members.
Milvus can store, index, and manage billions of embedding vectors generated by deep neural networks and other machine learning (ML) models as a cloud-native vector database. This level of scale is vital to handling the volumes of unstructured data generated to help organizations to analyze and act on it to provide better service, reduce fraud, avoid downtime, and make decisions faster.
Q7. In your opinion, what are the main challenges with implementing a vector database service at scale?
Building a vector database at scale is a complex and challenging task. Data volume is one of the most critical factors since the fundamental user requirement is to query at least a billion vectors in milliseconds. And the more data you have, the more complex it becomes to design the underlying systems. This onerous requirement means that the entire database architecture must be carefully considered and planned out in advance, starting with scalability and performance in mind, from the storage to the query layer.
With this in mind, we focused on building a new storage format to store vectors most efficiently to capture locality or the tendency of similar vectors to occur close to each other in the vector space. Of course, at these volumes, we can’t store all the data in memory; therefore, building a hierarchical caching system is necessary. Our design stores vectors in a way that captures locality, allowing us to reduce the number of cache misses and improve the overall performance of the database.
A hierarchical caching system overcomes the limitations of memory capacity. This caching system stores the most frequently accessed vectors in a high-speed cache and gradually moves less frequently accessed vectors to slower storage tiers. A hierarchical caching system ensures that the most relevant data is always available in memory while still being able to store a large number of vectors.
Another consideration is the trade-off between storage space and query performance. Developers use vector databases or a variety of tasks, such as similarity search, clustering, and classification. However, each task requires different operations to be performed on the vectors. For example, similarity search involves the calculation of the distance between vectors, while clustering requires grouping similar vectors. By designing the storage format to optimize for the most common use cases, we can improve the efficiency of the database.
Finally, we also considered the choice of hardware, the design of the query interface, and the choice of programming language and libraries.
Modern systems, whether CPU or GPU, have multiple parallel processing capabilities with every single one. Numerous similarity metrics need to be considered, such as the Euclidean distance, Jaccard distance, Tanimoto distance, etc. To efficiently handle these similarity metrics, we built a distance call algorithm with a single execution engine.
Parallel computation can be performed on this data to speed up the process and make it more efficient. More specifically, parallel computation can be used to calculate the distance between vectors, which is a computationally expensive step in building a vector database.
By considering all of these factors, we can build a vector database that is fast, efficient, and scalable.
Q8. What new use cases does the rise of vector databases open up?
- Semantic text search: Processing and querying text across multiple vectors like intent, location, and previous search history can provide the context necessary for more accurate and nuanced results.
- Targeted advertising: Vector databases can be used in targeted advertising to improve the relevance and effectiveness of ad targeting. In this context, the database can store and index large amounts of data related to user behavior, demographics, and interests as high-dimensional vectors. Ads are then mapped to the same space as the users, making targeted advertising as simple as performing a query in Milvus.
- E-commerce: Vector databases such as Milvus can power product recommendation engines by combining multiple sources of unstructured data such as search history and past purchases.
- UGC recommendation: User-generated content comes in various formats, ranging from simple text (blog posts, news articles, etc.) to short- and long-form videos (TikTok, Youtube, etc.). Each piece of content is a single vector representation in a vector database. This vector representation makes recommending new content as easy as querying over content users have liked or engaged with previously.
- Risk-control and anti-fraud: Anti-fraud systems can also use vector representations to encode similarities between actions and other data points. For example, an anti-fraud system can compare the vectors representing different transactions or behaviors to identify similarities that may indicate a higher risk of fraud or other illicit activities.
- New drug discovery: In drug discovery, vector representations of compounds include the overall structure and biological properties. A vector database can store and index this data as high-dimensional vectors, enabling new drug discovery simply by querying.
Q9. Why did you create Milvus (OSS)/Zilliz Cloud (commercial) and how does it tackle the challenges of working with unstructured data?
Milvus is an open-source vector database we created to make working with unstructured data as efficient as possible. Zilliz Cloud, on the other hand, is a commercial product built on top of Milvus that offers advanced features and capabilities.
Building open-source software can improve the software itself. By making the source code available to others, you allow others to find and fix bugs, suggest improvements, and add new features. As a result, making your software open-source can lead to a higher quality product overall and a better user experience.
More importantly, building open-source software helps to create a sense of community and collaboration. By contributing to open source, you become part of a larger group of developers passionate about creating high-quality software.
Zilliz innovates in part thanks to the adoption of an open-source software development model. We listen closely to the open-source community that makes our work possible, continuously iterating to maximize the utility of everything we build. Because unstructured data processing and analysis is an emerging field, remaining agile has been paramount to our relevance and success.
Innovation also happens through transparency, a core value in our company culture. The best ideas are often found where you least expect them, and it encourages Zillizers always to speak their mind, which helps us prioritize things that make an impact and support and develop our employees.
Q10. How can people get started with vectorizing and working with unstructured data?
Users who want to try working with a purpose-built vector database can try Zilliz Cloud for free (all new signups get $400 in credits). Zilliz Cloud includes the blazing-fast speed of open-source Milvus with additional security features and elastic scaling for growing workloads, all without the hassle of managing the hosting.
Charles Xie | Founder, CEO, Zilliz
Charles Xie is an expert in database and AI with more than 20 years of experience. He is the founder and CEO of Zilliz, an open-source software company developing unstructured database systems for AI applications. He is also serving as the board member of LF AI & Data, an umbrella foundation of the Linux Foundation supporting open-source innovations in artificial intelligence, big data, and analytics.
Before Zilliz, Charles worked many years at Oracle US headquarters, where he was developing Oracle’s relational database systems and then became a founding member of the Oracle 12c cloud database project. The project proves to be a huge business success and has accumulated revenue of over $10 billion to date. Charles holds a master’s degree in computer science from the University of Wisconsin-Madison and a bachelor’s degree from Huazhong University of Science and Technology.
Sponsored by Zilliz.