On Generative AI and Databases. Interview with Adam Prout
” With GenAI also requiring massive amounts of training data, the need for greater storage capacity is crucial. Databases are designed to scale as data volumes grow, ensuring generative AI projects can handle larger datasets as they become available. This means databases can help support the growing demand for AI capabilities across the business world. “
Q1. How is Generative AI transforming the way we store, structure, and query data?
Adam Prout: The focus of generative AI is to create new data, such as texts and images. At its core, GenAI is made of neural networks, a subset of machine learning that handles unstructured data like text, audio, images, and videos. These networks consist of connected layers that learn from training data and identify patterns to make new instances. But it’s not creating copies of the existing instance in the data set. Instead, these networks develop unique data points based on the training data. As a result of increased computational power and the massive amounts of data produced in recent years, it has paved the way for generative AI.
Due to advancements in GenAI, many organizations are exploring the ways that the technology can increase efficiencies in their operations. For example, generative AI can help data analysts find hidden patterns in data sets, deriving actionable insights faster than a human could. In other instances, data augmentation helps organizations generate more data to train neural networks. Models like generative adversarial networks (GANs) can learn the distribution of original data, augment it, and create synthetic data to diversify training datasets for machine learning models. Likewise, content creation is a significant use case for generative AI as organizations can create reports, summaries, and other deliverables using proprietary data at a rapid speed.
As for query data, we can ask questions of our data in natural language, creating efficiency over writing an SQL query or doing a full text search. More data is being stored in vector embeddings in databases, and looked up via Approximate Nearest Neighbor (ANN) vector searches, as a result of GenAI.
There are many more ways that generative AI helps organizations better leverage their existing data while generating original instances. We’ll continue to discover how generative AI can transform the way we store, structure, and query data for years to come.
Q2. Generative AI relies on large amounts of data to generate human-like answers. Among the challenges faced by generative AI are Data Quality and Quantity. How can a database help here?
Adam Prout: Databases provide a structured framework for data storage, allowing organizations to implement routine data quality checks and validation rules to ensure models are only trained on high-quality information. Another advantage of using a database is the consistent maintenance of data through cleansing and enrichment tools. These processes remove inconsistencies, duplicates, and errors from the data, leading to better model training and improved generative AI outputs.
With GenAI also requiring massive amounts of training data, the need for greater storage capacity is crucial. Databases are designed to scale as data volumes grow, ensuring generative AI projects can handle larger datasets as they become available. This means databases can help support the growing demand for AI capabilities across the business world.
Q3. Unlike traditional AI workloads that require additional specialized skills, new Generative AI workloads are available to a larger segment of the developer community. What does it mean in practice?
Adam Prout: This is great news for the practice. More software developers are able to leverage generative AI tools to increase efficiency and solve simple, clearly defined problems. And with a growing number of advanced AI code-generation tools on the market, developers can experiment with these technologies to create artificial data and test their code.
It’s no surprise that developers will play a key role in the GenAI revolution. Their expertise and skill sets are vital to improving the performance of AI and machine learning models. They’ll be able to successfully pivot to focusing on AI development as the need for AI/ML skills skyrockets.
Q4. Generative AI: How to Choose the Optimal Database?
Adam Prout: When selecting the right database for AI and machine learning models, organizations need to take into account several considerations:
- Speed of data processing: The ability to handle large volumes of data while processing information quickly can help organizations gain real-time insights to drive decision-making. This is especially true when working with streaming data or developing applications that require quick response times such as fraud detection or recommendation systems. A database built on a distributed architecture and in-memory data story can enable data processing at lightning-fast speed, helping organizations make fast and informed decisions.
- Vector search: The way vector searches handle high-dimensional data and provide advanced search and similarity capabilities helps organizations simplify data management processes. A vector search categorizes data based on multiple features, allowing organizations to store and search high-dimensional vectors efficiently. This capability helps organizations build more accurate and effective machine learning models as it filters comprehensive datasets into the systems.
- Scalability and integration: As AI requires more computing power and training data, selecting a database becomes even more important to help organizations build out their capabilities. Massive AI projects need a database that can handle complex queries at scale while helping extract and transform data to train AI/ML platforms. A highly scalable database can help companies meet increasing demands for AI-powered workloads. General purpose databases are flexible enough to handle a wide swath of data.
- Real-time analytics capabilities: Databases with built-in analytics capabilities can help organizations quickly identify trends and patterns in their data to make more informed and instantaneous decisions. The ability to run analytical queries paired with transactional ones in the same database system, known as hybrid transactional/analytical processing (HTAP), can eliminate the need for separate systems to complete tasks — simplifying the data architecture and reducing costs. This also offers greater flexibility as organizations look to adopt more AI capabilities into their operations.
Q5. Are NoSQL databases better suited for Generative AI than SQL databases?
Adam Prout: NoSQL and SQL databases each have their own strengths and weaknesses, and which one works best for Generative AI depends on what your project needs. NoSQL databases inherently come with more flexibility when it comes to handling unstructured or semi-structured data, which can be beneficial for certain types of data used in Generative AI – think text, images, and sensor data. As for SQL databases, they provide powerful query capabilities, enabling IT leaders to perform complex data retrieval and analysis.
To put it simply, many GenAI projects use a combination of both types of databases, leveraging the strengths of each. When choosing which database to utilize, it’s critical to evaluate the needs and constraints of your project.
Q6. Some SQL databases do have some features that make them compatible with Generative AI, such as supporting JSON data and functions. Are they suited for Generative AI?
Adam Prout: SQL databases that support features, like JSON, can be well-suited for certain aspects of Generative AI, largely when dealing with flexible or semi-structured data formats. Some benefits these features provide are JSON support, schema flexibility, data integration, complex querying, and scalability.
However, depending on the nature of one’s data – the volume and the complexity – a combination of SQL databases with NoSQL databases may also be a suitable solution. There isn’t a “one-size-fits-all” approach, and to ensure you’re best aligning with your project’s needs and constraints, it’s important to evaluate the end goal that is wanting to be achieved by this particular project.
Q7. Are databases with vector support the bridge between LLMs and enterprise gen AI apps? Why?
Adam Prout: Databases that include vector support can most definitely play a crucial role when it comes to bridging the gap between LLMs and enterprise Generative AI applications for many reasons:
- Easier storage and retrieval of embeddings: LLMs, like ChatGPT, generate word embeddings or vector representations of text data – meaning it’s not only designed to efficiently store embeddings but also to retrieve them, making it easier to manage and query.
- Quick and accurate similarity searches: Vector searches reign supreme when it comes to performing similarity searches, and in the context of Generative AI, this is very valuable, as it enables applications to find similar documents or content quickly.
- Scalability: Scalability is crucial for enterprise applications that need to process vast amounts of data, especially as LLMs continue to produce substantial volumes of vector data. Vector search are purpose-built to efficiently manage large-scale vector data, making them a vital component in handling such demands.
- Real-time applications: Various enterprise Generative AI applications like chatbots, sentiment analysis, and content generation, require real-time processing. Vector enables real-time retrieval and analysis of vector data – increasing the necessary responsiveness of applications.
Q8. Will vector databases be the essential infrastructure in bringing about the societal and economic changes promised by AI?
Adam Prout: Firstly, I want to clarify my thoughts on the term “vector database.” To SingleStore, vector search is a capability of a database, not a new category of database. That being said, databases that support vector indexing are suited for storing and querying high-dimensional vectors, meaning that they are well-equipped for tasks related to machine learning, recommendation systems, natural language processing, and more.
So, will vector searches be the essential infrastructure in bringing about the societal and economic changes promised by AI? They most definitely play a significant role, however, it’s important to understand that they are just one piece of a very large puzzle that includes algorithms, hardware, ethical considerations, and much more. Whether or not they become “the essential infrastructure” depends on various factors, such as specific applications and use cases of AI. In addition, good results from GenAI prompts often require more than a vector search – often there is a need for more traditional filters on other attributes of data and the like.
Q9. Who is already using Generative AI in the enterprise world?
Adam Prout: A recent report explored how companies are utilizing generative AI and shared: 46% for content generation, 43% for developing analytics insights summary, 32% for analytics insight generation, 32% for code development, and 27% for process documentation. On top of this, most companies are curious about AI but don’t use it as part of their everyday process, with the majority of 53% saying they are “exploring” or “experimenting” with the tech.
All of this is to say that the use of generative AI in the enterprise landscape continues to evolve rapidly – whether that be organizations fully implementing the tech in their day-to-day operations, or employees utilizing it to complete specific tasks.
Q10. SingleStoreDB has evolved over the past 10 years from its early days as MemSQL (in-memory OLTP) to become a more general purpose distributed SQL Database. How do you manage AI and Generative AI?
Adam Prout: When we first founded MemSQL, people were saying SQL couldn’t scale – we know that wasn’t true. We knew we could build something scalable, but similar enough to traditional, single-host so that customers wouldn’t have to learn a whole new database.
That took us to real-time analytics and progressing to a general purpose database. We expanded to a broad set of workloads and analytics, with performance similar to or even better than specialized systems. We’re giving customers the flexibility that comes with general purpose databases, as well.
As for AI, SingleStore has supported basic exact-match vector search capabilities for many years and we are adding improved vector indexes for ANN search for larger data sets. We believe vector searches combined with general purpose SQL databases capable of filtering, full text search, JSON and the like are crucial capabilities to unlock the most value from GenAI.
Adam Prout, CTO and Co-Founder, SingleStore
Adam Prout is the CTO at SingleStore and oversees product architecture and development. He joined SingleStore in 2011 as a co-founding engineer. Previously, Adam led engineering efforts on kernel development at Microsoft SQL Server. He holds Bachelor degrees in Computer Science and Mathematics, and a Masters degree in Mathematics from the University of Waterloo.
On Generative AI. Interview with Philippe Kahn, ODBMS Industry Watch, June 19, 2023
On Generative AI. Q&A with Bill Franks, ODBMS.org JUNE 26, 2023
Follow us on X: @ODBMSorg