On Time Series Databases. Interview with Ryan Betts
I have interviewed Ryan Betts, VP of Engineering at InfluxData. We talked about time series databases, InfluxDB and the InfluxData stack. RVZ
“Time series databases have key architectural design properties that make them very different from other databases. These include time-stamped data storage and compression, data lifecycle management, data summarization, ability to handle large time-series-dependent scans of many records, and time-series-aware queries.“–Ryan Betts
Q1. What is time series data?
Ryan Betts: Time series data consists of measurements or events that are captured and analyzed, often in real time, to operate a service within an SLO, detect anomalies, or visualize changes and trends. Common time series applications include server metrics, application performance monitoring, network monitoring, and sensor data analytics and control loops. Metrics, events, traces and logs are examples of time series data.
Q2. What are the hard database requirements for time series applications?
Ryan Betts: Managing time series data requires high-performance ingest (time series data is often high-velocity, high-volume), real-time analytics for alerting and alarming, and the ability to perform historical analytics against the data that’s been collected. Additionally, many time series applications apply a lifecycle policy to the data collected — perhaps downsampling or aggregating raw data for historical use.
With time series, it’s common to perform analytics queries over a substantial amount of data. Time series queries commonly include columnar scans, grouped and windowed aggregates, and lag calculations. This kind of workload is difficult to optimize in a distributed key value store. InfluxDB uses columnar database techniques to optimize for exactly these use cases, giving sub-second query times over swathes of data and supporting a rich analytics vocabulary.
While time series data is typically structured, it often has dynamic properties that aren’t well-suited to strict schema enforcement. Time series databases often specify the structure of data but allow schema-on-write. Another way of saying this is that time series databases often support arbitrary dimension data to decorate the contents of the fact table. This allows developers to create new instrumentation or collect metrics from new sources without performing frequent schema migrations. Document databases and column-family stores similarly allow flexible schema in their own contexts. The motivation with time series is similar — optimizing for developer productivity.
In addition to high-performance ingest, non-trivial analytics queries, and flexible schema, TSDBs also need to bridge real-time analytics to real-time action. There’s little point doing real-time monitoring if you can’t also automate real-time responses. So time series databases, like other real-time analytics systems, need to provide the analytics function and the ability to tie into real-time operations. That means integrating automated alerting, alarming, and API invocations with the query analytics performed for monitoring.
Q3. How do you manage the massive volumes and countless sources of time-stamped data produced by sensors, applications and infrastructures?
Ryan Betts: The InfluxData stack is optimized for both regular (metrics often gathered from software or hardware sensors) and irregular time series data (events driven either by users or external events), which is a significant differentiator from other solutions like Graphite, RRD, OpenTSDB, or Prometheus. Many services and time series databases support only the regular time series metrics use case.
InfluxDB lets users collect from multiple and diverse sources, store, query, process and visualize raw high-precision data in addition to the aggregated and downsampled data. This makes InfluxDB a viable choice for applications in science and sensors that require storing raw data.
At the storage level, InfluxDB organizes data into a columnar format and applies various compression algorithms, typically reducing storage to a fraction of the raw uncompressed size. Time series applications are “append-mostly”. The majority of arriving data is appended. Late arriving data and deletes occur with some frequency — but primarily writes result in appending to the fact table. The database uses a log structured merge tree architecture to meet these requirements. Deletes are recorded first as tombstones and are later removed through LSM compaction.
Q4. Can you give us some time series examples?
Ryan Betts: Time series data, also referred to as time-stamped data, is a sequence of data points indexed in time order. Time-stamped is data collected at different points in time.
These data points typically consist of successive measurements made from the same source over a time interval and are used to track change over time.
Weather records, step trackers, heart rate monitors, all are time series data. If you look at the stock exchange, a time series tracks the movement of data points, such as a security’s price over a specified period of time with data points recorded at regular intervals.
InfluxDB has a line protocol for sending time series data which takes the following form:
<measurement name>,<tag set> <field set> <timestamp>
The measurement name is a string, the tag set is a collection of key/value pairs where all values are strings, and the field set is a collection of key/value pairs where the values can be int64, float64, bool, or string. The measurement name and tag sets are kept in an inverted index which makes lookups for specific series very fast.
For example, if we have CPU metrics:
cpu,host=serverA,region=uswest idle=23,user=42,system=12 1549063516
Timestamps in InfluxDB can be by second, millisecond, microsecond, or nanosecond precision. The micro and nanosecond scales make InfluxDB a good choice for use cases in finance and scientific computing where other solutions would be excluded. Compression is variable depending on the level of precision the user needs.
Q5. The fact that time series data is ordered makes it unique in the data space because it often displays serial dependence. What does it mean in practice?
Ryan Betts: Serial dependence occurs when the value of a datapoint at one time is statistically dependent on another datapoint at another time.
Though there are no events that exist outside of time, there are events where time isn’t relevant. Time series data isn’t simply about things that happen in chronological order — it’s about events whose value increases when you add time as an axis. Time series data sometimes exists at high levels of granularity, as frequently as microseconds or even nanoseconds. With time series data, change over time is everything.
Q6. How is time series data understood and used?
Ryan Betts: Time series data is gathered, stored, visualized and analyzed for various purposes across various domains:
- In data mining, pattern recognition and machine learning, time series analysis is used for clustering, classification, query by content, anomaly detection and forecasting.
- In signal processing, control engineering and communication engineering, time series data is used for signal detection and estimation.
- In statistics, econometrics, quantitative finance, seismology, meteorology, and geophysics, time series analysis is used for forecasting.
Time series data can be visualized in different types of charts to facilitate insight extraction, trend analysis, and anomaly detection. Time series data is used in time series analysis (historical or real-time) and time series forecasting to detect and predict patterns — essentially looking at change over time.
Q7. You also handle two other kinds of data, namely cross-section and panel data. What are these? How do you handle them?
Cross-sectional data is a collection of observations (behavior) for multiple entities at a single point in time. For example: Max Temperature, Humidity and Wind (all three behaviors) in New York City, SFO, Boston, Chicago (multiple entities) on 1/1/2015 (single instance).
Panel data is usually called cross-sectional time series data, as it is a combination of both time series data and cross-sectional data (i.e., collection of observations for multiple subjects at multiple instances).
This collection of data can be combined in a single series, or you can use Flux lang to combine and review this data to gather insights.
Q8. There are several time series databases available in the market. What makes InfluxDB time series database unique?
Ryan Betts: When doing a comparison, the entire InfluxDB Platform should be taken into account. There are multiple types of databases that get brought up for comparison. Mostly, these are distributed databases like Cassandra or more time-series-focused databases like Graphite or RRD. When comparing InfluxDB with Cassandra or HBase, there are some stark differences. First, those databases require a significant investment in developer time and code to recreate the functionality provided out of the box by InfluxDB. Finally, they’ll have to create an API to write and query their new service.
Developers using Cassandra or HBase need to write tools for data collection, introduce a real-time processing system and write code for monitoring and alerting. Finally, they’ll need to write a visualization engine to display the time series data to the user. While some of these tasks are handled with other time series databases, there are a few key differences between the other solutions and InfluxDB. First, other time series solutions like Graphite or OpenTSDB are designed with only regular time series data in mind and don’t have the ability to store raw high-precision data and downsample it on the fly.
While with other time series databases, the developer must summarize their data before they put it into the database, InfluxDB lets the developer seamlessly transition from raw time series data into summarizations.
InfluxDB also has key advantages for developers over Amazon Timestream. Among them:
- InfluxData is first and foremost an open source company. It is committed to sharing ideas and information openly, collaborating on solutions and providing full transparency to drive innovation.
- Hybrid cloud and on-premises support. Every business has specific functionalities, and a hybrid cloud system offers the flexibility to choose services that best fit their needs, whether to support GDPR regulatory requirements or teams that are spread across multiple providers.
Q9. What distinguishes the time series workload?
Ryan Betts: Time series databases have key architectural design properties that make them very different from other databases. These include time-stamped data storage and compression, data lifecycle management, data summarization, ability to handle large time-series-dependent scans of many records, and time-series-aware queries.
For example: With a time series database, it is common to request a summary of data over a large time period. This requires going over a range of data points to perform some computation like a percentile increase this month of a metric over the same period in the last six months, summarized by month. This kind of workload is very difficult to optimize for with a distributed key value store. TSDB’s are optimized for exactly this use case giving millisecond- level query times over months of data.
Q10. Let’s talk about integrations. Software services don’t work alone. Suppose an application relies on Amazon Web Services, or monitors Kubernetes with Grafana or deploys applications through Docker, how easy is it to integrate them with InfluxDB?
Ryan Betts: InfluxData provides tools and services that help you integrate your favorite systems across the spectrum of IT offerings, from applications to services, databases to containers. We currently offer 200+ Telegraf plugins to allow these seamless integrations. Developers using the InfluxDB platform build their applications with less effort, less code, and less configuration with the use of a set of powerful APIs and tools. InfluxDB client libraries are language-specific tools that integrate with the InfluxDB API and can be used to write data into InfluxDB as well as query the stored data.
Ryan Betts is VP of Engineering at InfluxData. Ryan has been building high performance infrastructure software for over twenty years. Prior to InfluxData, Ryan was the second employee and CTO at VoltDB. Before VoltDB, he spent time building SOA security and core networking products. Ryan holds a B.S. in Mathematics from Worcester Polytechnic Institute and an MBA from Babson College.
Follow us on Twitter: @odbmsorg