On InfluxDB Edge Data Replication. Q&A with Sam Dillard
Data has gravity, and that gravity gets stronger as applications become more distributed and data volumes at the edge grow exponentially. So the big challenge is creating a way to manage and utilize edge data in an efficient yet centralized way.
Q1. Do you see a rise in highly distributed applications producing enormous data volumes at the edge?
Absolutely. The enormous volume of data produced by distributed applications is pushing the edge to be more critical than ever. Therefore it’s critical for developers and engineers to design a data topology that leverages the edge in a way that ensures they can unlock insights from massive volumes of edge data. Every high-tech industry – from manufacturing to aerospace to energy – is undergoing a transformation to create infrastructures that can better manage edge data.
Q2. What are the typical challenges for such applications?
Data has gravity, and that gravity gets stronger as applications become more distributed and data volumes at the edge grow exponentially. So the big challenge is creating a way to manage and utilize edge data in an efficient yet centralized way. First, we should look at the properties of edge data vs. cloud data:
- Edge data: Edge data presence makes data cheaper, subject to less latency and does not require the internet to make decisions. It’s also important to note that edge assets rarely deal with data pertaining to any other asset but themselves. To analyze many assets, operators need to zoom out and consolidate contexts for a centralized view.
- Cloud data: Cloud data is a completely different kind of data to manage. The cloud is central, meaning it can see all and provide context surrounding devices that otherwise wouldn’t have it, and it can store more data for the more brainy business-level insights.
The relationship between the edge and the cloud hinges on their interdependence. Edge cannot see the forest for the trees and the cloud can only see what it is given. By leveraging their respective advantages, operators can be more efficient by consuming and analyzing only the data and insights that each needs.
Q3. You just announced InfluxDB Edge Data Replication. What is it?
Edge Data Replication (EDR) is a new feature from InfluxData that we announced last month. It enables developers to process data at the edge and automatically replicate it to the cloud in a durable, reliable way. EDR ensures the edge and cloud data layers work together in a way that delivers centralized business insights across widely distributed environments in near real-time.
Developers that leverage EDR in tandem with InfluxDB and Flux, InfluxDB’s native data scripting language, can realize the following benefits: reduce cloud ingress and egress costs; transform data before transfer; enforce edge/cloud consistency; and realize full data workflow at the edge.
Q4. How does Edge Data Replication enable developers to analyze time series data in InfluxDB at the edge?
EDR works by allowing users to safely replicate data from InfluxDB OSS buckets to InfluxDB Cloud buckets in real-time, meaning they can now collect, store and analyze high-precision time series data from the edge, and view that data in the cloud.
This new feature builds upon two key properties of InfluxDB OSS. First, InfluxDB OSS can run very efficiently on resource-constrained systems. Second, InfluxDB OSS runs Flux, a functional data scripting language and task engine that can analyze and transform time series data in any way developers choose. With EDR, we’ve developed a way to connect these OSS properties to enable a full-featured edge-to-cloud data pipeline and unlock new data processing that combines the precision of the edge and the power of the cloud.
Given the constraints in data transfer, an important property of edge-to-cloud data movement is data reduction. Veterans of this know that data reduction often means losing important detail in your data. However, the added benefit of the Flux engine is that it supports both built-in and custom functions that can do this in just about any way users want. This means they can employ sophisticated algorithms to do aggregations that retain data shape and important events in the original data.
Ultimately, operators of this architecture can assure their users that the data at the edge will make it safely to the cloud and, if necessary, be faithful to its original form.
Q5. Why is replicating all or subsets of this data into InfluxDB Cloud critical for additional processing and visibility?
Edge data is useful in both places. At the edge, it is used in its raw and most granular form. The data at the edge is most faithful to what actually happened at the edge, and can be used to investigate and remediate issues at the device/system level. That said, cloud computing is an enormous industry for a reason too, and cloud computing needs cloud data.
Having data that was born at the edge also stored in the cloud accomplishes a lot. First, it makes the data available centrally for any application/client that may need it, which adds enormous value alone. Additionally, the scale of the cloud allows for an entirely different type of number crunching involving much larger computations like training ML models and forecasting big picture metrics that involve data from everywhere.
Q6. Tell us about some use cases for Edge Data Replication. What are some common scenarios developers will use this feature?
Use cases for EDR span any industry that involves building or managing distributed applications – applications that connect people, machines, and software. Manufacturing, aerospace, energy and other high-tech industries are full of these. Every industry has its edge, and the biggest opportunity from the edge across industries comes from organizations with technology deployed right at the line where cyber meets physical. Here are a few examples:
- In manufacturing: Data is created at the edge by machines on the factory floor and the local applications that power them. Local operators look at machine data related to input control changes, mechanical issues, and operations procedures. On the flip side, the team at corporate headquarters also needs this information, but moving these massive volumes of data to a cloud warehouse isn’t practical since data is expensive to move in bulk and it’s difficult to ensure all data moves from edge to cloud without any loss or corruption. With EDR combined with Flux data transformations, users can now have access to mission-critical data where they need it to combine the power of the cloud with the precision of the edge
- In financial services: In a world where reducing network latency means big money, trading algorithms are deployed on devices installed as close to the market’s servers as physically possible. This strategy allows decisions to be made in microseconds, and this process creates vast amounts of infrastructure and algorithm performance and trade-related data. Shipping all of this data to a data warehouse is impractical, if not impossible, so performing real-time replication of aggregated data to a central repository is really the best way to understand and improve these critical platforms, and to retrain AI models and redeploy for more profitable trades.
- In energy: Take wind turbines, which require massive upkeep before they incur expensive malfunctions. The time series data emitted from the turbines can prevent future malfunctions, notifying operators to physically visit the farm for maintenance. The turbine health data is no good sitting with the machine when operators are remote, but it can also be useful at the machine when operators are there. EDR enables predictive maintenance (centralized) while also retaining detailed data for onsite support (localized).
Each of these use cases involves a time series architecture that leverages both edge and cloud environments (i.e processing data at the edge and in the cloud simultaneously). The edge environment keeps very granular data for local detailed analysis and powers things like “local control loops” but also sends data to the cloud so analysts have a more accurate picture of what happened at the distributed locations.
Q7. In the product announcement, you talk about “transform data before transfer”. What does it mean?
This is a critical property of the feature because it solves the biggest set of problems facing edge-cloud pipelines today, data egress at the edge. Replications make the data transfer fast and safe, but the reason InfluxDB was the best platform to introduce this feature was the native data transformation API it offers.
Putting automated control over the size and shape of data at the edge gives businesses the necessary tools to work within the constraints of internet bandwidth, cloud storage costs, and cloud query performance objectives. The almost Turing-complete nature of the computation engine (Flux) allows users to not only shrink their outbound data but also retain its usefulness. Edge-cloud data pipelines, given these constraints, do not work without this.
Q8. You also talked about reducing “cloud ingress and egress costs.” How?
Simply put, this is reduced by making the ingressed data smaller. Replications themselves do not do this, but they are the backbone of the data transfer. The Flux scripting engine is what provides the data reduction. Reducing the size of the transferred data means less bandwidth, cloud storage, and often cloud query costs.
There is the added benefit of Flux being bundled with the database, which means we are bringing the computation to the data and not requiring users run, host, and update their own processor clients. In this way, cloud ingress is again made less expensive.
Q9. Isn’t there a risk of duplication between data stored at the edge and in the cloud? How do you enforce edge/cloud consistency?
There will be some duplication by design, as that’s the core of how EDR works. That said, it’s not a risk with which users should be concerned. By design, these will be data points in physically separate locations used by different clients and applications for different purposes.
As for ensuring businesses don’t store too much edge data that they already have in the cloud, InfluxDB natively offers retention rule-based eviction of data. For example, an edge node may keep its to-be-replicated data around for only a few hours while the cloud holds onto it for years. In this sense, very little will be duplicated.
Qx Anything else you wish to add?
The last point I want to mention is that EDR is the first step in our broader vision to support future technologies, applications and developer challenges with the InfluxDB platform. At InfluxData, we talk a lot about ‘meeting developers where they are’ – by this we mean meeting developers in their language of choice, using their preferred tools, running on their preferred cloud platform, etc. With EDR, we’re meeting developers at the edge so they can quickly uncover critical insights and move on to other aspects of building their applications.
If you’re interested in unlocking edge-cloud duality with EDR, you can get started today by signing up for InfluxDB Cloud if you haven’t already. Otherwise, check out the feature documentation and watch our Meet the Developers video to learn more.
Sam Dillard is a Senior Product Manager, Edge at InfluxData. He is passionate about making customers successful with their solutions as well as continuously updating his technical skills. Sam has a BS in Economics from Santa Clara University.
Sponsored by InfluxData