On Benchmarking Time Series Database Systems for Monitoring Applications. Q&A with Abdelouahab Khelifati and Mourad Khayati
Time Series Database Systems (TSDBs) are specialized database systems designed to efficiently manage high-frequencytime series data. Unlike relational database systems, TSDBs rely on the general assumption that queries do not target individual tuples but summarized entries in a time interval.
Q1. What are Time series databases?
Time Series Database Systems (TSDBs) are specialized database systems designed to efficiently manage high-frequencytime series data. Unlike relational database systems, TSDBs rely on the general assumption that queries do not target individual tuples but summarized entries in a time interval. They implement dedicated indexing and compression schemes to process large volumes of timestamped data with minimal latency. In addition to the fast selection over time ranges, high ingestion rates, and optimized storage, TSDBs also provide support for complex queries and analytics on time series data.
Q2. Why are they essential for the large-scale deployment of many critical industrial applications?
Industrial applications often involve a multitude of sensors continuously emitting data at high frequency. Those applications require timely insights into historical patterns, proactive issue identification, and operational optimizations—essential elements for maintaining reliability and efficiency at scale. The industrial landscape heavily relies on TSDBs to provide actionable insights, anticipate issues, and optimize operations, making them necessary for sustaining the efficiency and reliability of critical systems on a large scale.
Q3. Specifically, what are monitoring applications?
Monitoring applications rely on the continuous tracking and analysis of time series data. This helps identify interesting patterns, such as anomalies or motifs, which are essential for informed decision-making. Examples of monitoring applications include tracking water quality, wind speed, climate change, or health status. In the healthcare field, for instance, monitoring applications utilize sensors for real-time tracking of patient vital signs and health metrics, facilitating remote monitoring and personalized care.
Q4. You have recently published TSM-Bench, a benchmark tailored for time series database systems used in monitoring applications. What is special about this benchmark?
TSM-Bench uniquely packages a three-pronged solution—time series data generation, query workload configuration, and performance evaluation—into an end-to-end benchmark. Despite its sophistication, TSM-Bench is very simple to use.
The first feature of TSM-Bench is a realistic data generator. Given that existing real-world time series are often limited in size and/or number, we provide a new scalable technique that augments the length and number of seed time series while preserving their key statistical properties.
Our data generator allows us to fairly evaluate the performance of TSDBs using large amounts of data without mischaracterizing the streaming requirements that accompany time series monitoring applications.
In addition to the generation component, TSM-Bench includes a suite of fundamental queries that serve as building blocks for more complex time series analytical operations. While existing works focus on static query configuration, our benchmark implements dynamic query variability, allowing for a more nuanced and comprehensive evaluation of TSDBs’ capabilities. It also combines both offline and online workloads. This acknowledges the dynamic nature of monitoring applications, where queries and data ingestion occur concurrently.
Third, to the best of our knowledge, TSM-Bench is the first benchmark that extensively evaluates the impact of time series features on the storage of those systems. By evaluating how TSDBs handle feature compression within monitoring applications, our benchmark sheds light on their encoding schemes. This feature distinguishes TSM-Bench from existing benchmarks, which overlook the nuanced challenges of feature-based encoding of time series data.
Lastly, our benchmark includes a diverse set of representative database systems, ensuring a comprehensive evaluation of their capabilities in managing time series data. Sequence-based systems, such as eXtremeDB, were not previously benchmarked and turned out to be among the best contenders.
Q5. How did you select the evaluated database systems?
The selection of the evaluated systems is based on the following criteria: (a) their popularity, (b) their performance in other benchmarks, (c) the support of the necessary operators to implement the queries, and (d) the results of pre-evaluation experiments. More specifically, we started with a popularity assessment, leveraging DB-Engines to identify the top 10 TSDBs. This initial selection was validated through OSS Insight, which considers repository stars, pull requests, and issues. Preliminary experiments were conducted on the pre-selected systems using a subset of queries with fixed parameters. Beyond popularity and performance, systems evaluated in other benchmarks and recognized for differences in underlying architecture were included. While some of those systems are not dedicated TSDBs, they are optimized for processing time series data.
Q6. Which databases did you consider for the benchmark?
Q7. What are the main results of this benchmark?
TSM-Bench highlights several key findings:
* We identified significant factors impacting TSDB performance, including the size of the selected data, the data output, and the specific operations performed by the query.
* The best-performing TSDB is contingent not only on the query type but also on the configuration of the query. Factors such as the number of sensors or the time range heavily impact query performance.
* Sequence-based systems, such as eXtremeDB, proved exceptionally efficient in queries with high selectivity involving small datasets.
* Systems implementing Single Instruction, Multiple Data (SIMD) architectures and sparse indexing, such asClickHouse, demonstrated efficiency in handling large datasets, especially during queries with high selectivity. ClickHouse was also found to perform exceptionally well for bulk-loading and compression tasks.
* InfluxDB and MonetDB emerged as top performer for queries under high insertion rates, highlighting their efficiency in scenarios with rapid data influx.
* Partitioning-based systems such as TimescaleDB and QuestDB demonstrated very good trade-offs across workloads.
Q8. What are the main lessons learned in performing this benchmark?
Our benchmark imparted several lessons, which include:
* No silver bullet system. Each TSDB exhibits unique strengths and specializations. Understanding the diverse capabilities of different systems is crucial for practitioners seeking to align database choices with their specific workload requirements.
* Our results show that the choice of the appropriate system architecture heavily depends on two main factors:query selectivity and data size.
* This work showcased the pivotal role of data quality in benchmarking, emphasizing the necessity of using data with properties similar to the application data for meaningful insights.
* Existing TSDBs support simple query operations such as selections, aggregations, resampling, and simple statistical metrics. In general, their support to
advanced time series analytical tasks, such as anomaly detection, missing values recovery, or forecasting, is still limited.
Q9. Can your benchmark be extended to cover other database systems?
Indeed, TSM-Bench allows the integration of new systems beyond the initially considered ones. We provide users with a comprehensive tutorial on extending the benchmark. Furthermore, TSM-Bench is versatile and can easily be extended with datasets, queries, and workloads.
Abdelouahab Khelifati is a PhD student at eXascale Infolab in Switzerland, working with Dr. Mourad Khayati and Prof. Philippe Cudré-Mauroux. During my PhD, I have worked on massive data series analysis, machine learning approaches for data augmentation, data compression, and benchmarking databases. I have published 3 articles in top conferences, including my recent benchmark paper, which was published at VLDB. Before that, I obtained two Master’s degrees in computer sciences at Higher school of computer science (ESI) in Algeria and University Paul Sabatier in France. Finally, before starting my PhD, I also worked as a research engineer at the computer science Lab IRIT in France.
Mourad Khayati is a Senior Researcher and a Lecturer with the eXascale Infolab group and the Advanced Software Engineering group, respectively, at the Department of Computer Science of the University of Fribourg, Switzerland. He obtained his PhD from the University of Zurich, Switzerland, under the supervision of Prof. Michael Böhlen. His research interests include Time Series analytics and temporal data cleaning/repair with a special focus on recovery of missing values. He is the recipient of the VLDB 2020 Best Experiments and Analysis Paper award.
TSM-Bench: Benchmarking Time Series Database Systems for Monitoring Applications, Abdelouahab Khelifati, Mourad Khayati, Anton Digno ̈s, Djellel Difallah, Philippe Cudr ́e-Mauroux VLDB’23, Vancouver – Canada August 30, 2023