An Introduction to Columnar Databases
By Charles Mahler
Choosing the right database for your application can be tough. The debate between SQL and NoSQL databases is ongoing, with questions around whether to store data in a relational database or a NoSQL alternative like key-value, document, or graph databases depending on your data model and use case. Another type of database that developers are starting to use for their applications are columnar databases.
In this article, you will learn about the benefits and drawbacks of columnar databases and why they might be a viable option for your application.
Columnar databases explained
A columnar database organizes data in a column format on disk, rather than a traditional row-based format used by other databases. This means that all values for a particular column are stored together sequentially on disk, making it easy to retrieve data for a specific column.
This provides a number of benefits for OLAP-style analytics queries where you are doing something like taking an average of all column values. In a standard relational database with data organized in rows, this type of query would be extremely inefficient, having to scan over every row to just grab a single column.
Benefits of columnar databases
One major benefit of storing data in columns is that optimal compression algorithms can be used for each column based on data type. This results in higher data compression ratios which leads to cost savings on data storage. Most columnar databases also allow the user to choose which compression algorithm to use for a column, allowing them to pick whether they want faster performance or better compression depending on their specific application needs.
Improved analytics query performance
Columnar databases are significantly better for analytics queries due to a number of factors that are enabled by storing data in columns. Here are a few examples:
- Vector processing – Columnar databases are better able to use features of modern CPUs to get better performance for analyzing data. One example of this is using vectorized processing to operate over multiple values at the same time.
- Parallel processing – Many columnar databases have the ability to spread workloads across multiple nodes to speed up queries.
- Pre-computed results – Columnar databases include a number of optimization techniques that often involve pre-computing results for common queries or creating materialized views of columns.
- Compression – Data isn’t just more compressed on disk, it is also compressed in RAM which means more data points can be fit into memory compared to a normal database, which means you will get faster results.
Columnar databases are designed from the ground up for working with large amounts of data and as a result most of them will have built-in support for scaling horizontally and vertically. Some systems will also have the ability to scale storage and compute separately, resulting in cost savings because you can fit your hardware to your workload with more fine-grained control.
Columnar database use cases
The most obvious use case for columnar databases are any type of analytics. Most columnar databases will provide a number of different ways to access stored data, allowing both technical and non-technical users to get insights from the underlying data. Common use cases are forecasting and data visualization.
Most data warehouses are using some form of columnar storage under the hood. Data warehouses are able to store data from a number of different sources and allow that data to be efficiently queried in a single place.
Columnar databases are a good option for storing data from IoT devices due to the large volume of data that they generate and how that data is queried. A common architecture for IoT and Industry 4.0 solutions is to deploy a database at the edge which can be used for monitoring and then move the data to the cloud for long-term storage and analysis for things like predictive maintenance.
Observability and monitoring
Another area where columnar databases make sense is for observability and performance monitoring of software applications. User expectations for apps keeps rising, so software developers are collecting data at much higher frequencies to help find performance issues faster to improve the reliability of their applications. Columnar databases are ideal for this because they are able to support all types of observability data like logs, metrics, and traces. When used properly, a columnar database can simplify the architecture of a monitoring solution by eliminating the need for multiple specialized databases often used for storing these different types of data.
Examples of columnar databases
InfluxDB is an open source time series database using columnar storage that has been optimized for time series data. It is built on open source projects like Apache Arrow and Parquet which make it easy to integrate with a broad ecosystem of tools.
Time series data tends to arrive in massive volume, which requires fast ingestion capabilities so data can be indexed and queried quickly for real-time monitoring and alerting, which can be a challenge for many columnar databases designed for more batch style ingest and query workloads. In addition to real-time use cases, many users want to be able to store this data long term for historical analysis and forecasting without breaking the bank.
InfluxDB allows users to get the best of both worlds by managing the lifecycle of data and moving it between hot and cold storage to get fast performance on historical and real-time data while also keeping storage costs down by using cheaper object storage for data that isn’t being actively queried.
ClickHouse is an open source columnar database designed for general analytics workloads. It is known for its query speed and can handle petabyte-scale data warehousing. ClickHouse also has built-in support for distributed query processing and provides a number of configuration options to modify how it indexes and stores data.
Redshift is a fully-managed cloud-based data warehousing service that uses a columnar storage model. It is designed for analytics workloads and can handle data at almost any scale. An advantage of Redshift is the number of integrations with the surrounding AWS ecosystem.
Vertica is designed for analytics workloads and was one of the first commercially available columnar databases. Vertica can handle large-scale data warehousing and has built-in support for machine learning and advanced analytics.
The main takeaway here is that columnar databases offer significant advantages over traditional row-based databases for certain use cases, particularly for applications that require efficient processing of large amounts of data. Columnar databases can provide faster query response times, more efficient use of storage space, and improved performance for analytical workloads.
On the other hand, there are also tradeoffs and weaknesses to consider, such as slower write performance in certain situations. When evaluating whether to use a columnar database, you should consider your specific use case and requirements. With the growing popularity of columnar databases, there are a number of options available, each with its own set of features and capabilities.
Charles Mahler is a Technical Marketing Writer at InfluxData where he creates content to help educate users on the InfluxData and time series data ecosystem. Charles’ background includes working in digital marketing and full-stack software development.
Sponsored by InfluxData.