Principles of Tiered Storage
Tiered Storage is a concept gaining revitalized importance, as companies realize the costs associated with IoT. Although the concept of Tiered Storage dates to the 90s, modern storage and database solutions now allow for broader consideration of topologies when designing solutions atop Operational data stores and Data Lakes.
To review; IoT data is effectively Fast Data which is a subset of Big Data. Inherently IoT has a need to support Volume, Velocity and a Variety of data from sensors as well as process it quickly and efficiently. This requisite processing and storage comes with a cost; especially for real-time analytics. Thus, it is not surprising companies are finding themselves in extremis coping with the costs of storage while simultaneously addressing larger volumes of data.
A key tenet of the Tiered Storage data architecture pattern is that data is stored across storage mediums of varying costs and performance capabilities. These include disks, memory, and Cloud-object storage. I will differentiate between Tiered Storage from “archiving” whereas, unlike data archiving practices, Tiered Storage instead makes the data available to applications in situ without the need to restore it. We can capitalize on this by migrating older data to less expensive and slower mediums with the expectation data will not be queried as often or be as time sensitive. Slower performance is an acceptable tradeoff for less expensive storage and it is up to you to determine how your data should be distributed across mediums as well as acceptable levels of performance.
There are several ways to implement Tiered Storage including hybrid examples of Cloud and on-premises as well as considerations at the server level such as RAM, SDD, and HDD. In a simplistic example, we can store data on a virtual machine and place newer “hot” data on SSDs and older “cold” data on HDDs. Moreover, as data becomes even less valuable it can be migrated to cheaper Cloud storage such as Azure Blob storage. The key principle being that we want to keep our most valuable “hot” data on the most performant medium and progressively migrate less valuable data downstream to ever less expensive ones. Therefore, your “shard” or partitioning keys are of critical importance and must align with your data access needs. For example, you may distribute data by a timestamp and keep only the last month of data in RAM while data older than a year may be stored in Azure Blob storage. Everything else in between would be stored on progressively cheaper mediums per your distribution key(s). Moreover, advances of newer technologies in the NoSQL, NewSQL, and Cloud storage domains allow us to be very creative in our architectures.
For example, SQL Server 2016+ allows you to leverage both relational data in the same query with data on HDFS. Furthermore, both SQL Server and MongoDB allow you to implement a topology of nodes that store data in-memory, on SSD, and finally on HDD.
In conclusion, consider the Tiered Storage architecture pattern as you pursue storage solutions for your IoT implementations. Understand your workload and data access requirements to better identify the types of storage you need. I strongly recommend you then consider a Distributed database technology at the forefront of your solution. Distributed database engines can perform targeted queries across data shards, utilizing various storage mediums, per the user-defined distribution key. This ensures the query is as efficient as possible and only the data shards required will be queried across storage tiers.