On Change Data Capture. Q&A with Gary Hagmueller
“CDC is the gold standard for moving data from an operational system to another (think migration) or an analytical system (think replication). “
Q1. What are the main challenges when moving high-volume, high-velocity data between transactional databases, data warehouses, and cloud platforms?
A: Most folks who have not had to face this task assume: “data is data, how hard could it be?” Anyone who has ever actually needed to move data between systems & platforms knows just how vexing data movement really is. There are too many challenges to list them individually but let me try to consolidate them under three main group:
- Normalization
Anyone who has worked on consolidating data sources will agree that each system, no matter how similar, varies to some degree that can make it incredibly hard to integrate data. Purpose-built systems can have materially different data formats for the same data set. Integrating data with different formats is the challenge. Let’s discuss a simple example to illustrate this.
In one system, you may be storing a zip code as a 5-digit number string. In another, it may be a 9-digit string. In one system, a birthdate has 6 digits, another has 8. You get the point. At a basic level, all these nuances, whether subtle or significant, need to be normalized.
Now take this a step further and think of all the things that need to be normalized to properly integrate data. How do the column names line up? What about schemas – what happens if one end is a document based system & the other is a SQL or NoSQL based system? What happens if one system is distributed & the other not? The list of things that need to be normalized for data to be meaningful and usable in data applications or data analysis is considerable.
- Scale
Let’s assume you’ve hired top tier database experts and solved the normalization issues we just talked about. Next, we need to remember that modern organizations are producing data at an exponential rate. While such data volumes can act as a rich source of inputs for downstream applications, managing data at this scale is a difficult engineering problem to solve.
For legacy applications, adding scale meant adding new compute, storage or memory and then manually configuring the new hardware to work with the application load. Automation made it easy to deploy applications in a box, but configuring those applications to scale seamlessly inside the box remained a challenge.
Modern systems, by contrast, are designed to automatically scale up and down, based on the needs of the applications. They align with the cloud infrastructure that runs most modern enterprise applications.
Infrastructure vendors of the pre-cloud era also had to solve scale issues, especially in moving data between platforms. Data migration and replication have always required a lot of compute. As little as 5 years ago, compute was extremely expensive and often out of physical reach.
So vendors dealt with these old scalability problems in ways that seemed logical then. Since databases a decade or more ago were mostly focused on read & writes of text or small object-type data, it made sense to prioritize moving smaller objects over larger ones. There were other techniques too, but the basic approach remained the same. But database technologies and enterprises using them have gone through a paradigm shift since then. Modern databases now deal with larger objects in real time, which means the legacy approach now results in significant latency and unstable pipelines. Solutions born in the modern era such as Arcion are free of this highly limiting architecture.
- Integrity
OK, so let’s say you’ve invested heavily in database experts and distributed system pros and built software that normalizes everything and can handle the scale issues inherent in your modern data (if you’d like a job, please contact me as your skill level is one in a million). You now have the last and perhaps most daunting problem to solve: data integrity. Integrity of data really matters in the modern data stack. Real time is only real if you deliver the order of transactions as they occur. Else, you will by default, produce incorrect results.
But that’s not all. You will have to deal with both committed and uncommitted transactions from databases. Ask the database servers to send only commits & you slow them down. Pass uncommitted entries to the warehouse or data app & they either have to run lots of extra complex queries (that break with the slightest data change) or they will generate garbage. What about those pesky network or system downtime issues, how do you guarantee that the data in the destination system has been delivered 100% of the time but never more than 100%? This list will go on for a long time, but you can probably think of a number of other integrity challenges.
Nevertheless, once you solve these issues, you’re pretty much home free! The only things left are to containerize the software, integrate it with terraform, provide instrumentation so you know it’s working at all times, do the DevOps work needed to keep it humming & staff the maintenance team to fix the issues that invariably occur!
Q2.You are offering a so-called distributed CDC-based data replication platform. What is it? and what is it useful for?
I mentioned earlier that Arcion is the only real modern version of CDC software. What it means is that we align to highly performant databases such as SingleStore that are run in the cloud. And by cloud, we mean both serverless deployments modes & VPCs.
Arcion’s architecture is multi-threaded at all levels. So, instead of having to manually configure resources, all that a user needs to do is spin up the compute & the software will manage how it works based on the loads. Have a big spike you think is coming, spin up a few more CPUs & let it do its magic. Seeing a slowdown coming? Just drop CPUs & cut your spend.
We can also run on spot instances without the problems that would create for all older generation CDC vendors. SO if you really want a modern solution that can optimize both for performance and cost, there’s really little choice.
Q3. What “Change Data Capture” really means in practice?
CDC is the gold standard for moving data from an operational system to another (think migration) or an analytical system (think replication). CDC, at least the Arcion variant, does not query the operational system or install software on the transactional servers. Instead, we read the logs that the production system has already generated. The way in which Arcion reads those logs allows us to identify new, deleted or changed entries and immediately send them up the wire. That’s how we ensure the source and target systems are always in perfect sync in real time.
Q4. SingleStore delivers SingleStoreDB, a distributed SQL database for data-intensive applications. How does it relate to your data replication platform?
We love SingleStore as you have built a very highly performant database that can really take advantage of the real time speed we are able to deliver through Arcion. SingleStore is a great place to perform anything that needs real time data. We recently worked with a company that has 1m+ active users across 23 countries, outgrew their MySQL database because they had trouble scaling search for their biggest customers and the complex storefronts queries took more time than expected, they migrated 400M rows from MySQL into SingleStore. With two weeks of preparation time, the migration was successfully done in minutes (read more about the story).
SingleStore is a cutting edge database technology and we’re seeing SingleStore can take up to 500k ops/sec in ingestion speed.
Moreover, we’re seeing a host of customers move data in from a number of legacy platforms. No one migrates a working database without a lengthy migration process to ensure everything works as designed and nothing breaks. This means systems need to stay in near perfect sync for an extended period of time to properly build and test the new SingleStore deployment. Aside from the technical bits Arcion provides, the most valuable benefit we deliver is the ability for customers to dramatically reduce the risk of a data modernization project centered on SingleStore while increasing the likelihood that such projects succeed.
Q5. Can you give us some examples on how does Arcion enhances SingleStoreDB?
There are two key use cases for which companies should leverage Arcion to enhance their SingleStoreDB experience:
- Database Migration
SingleStore is a high-performance relational database management system (RDBMS) designed for real-time operational analytics. It is ideal for applications that require fast data ingestion, complex queries, and uses real-time analytics.
At the enterprise level, most companies have numerous different database technologies in use. As a result, they often need to migrate data from one database type to another, like SingleStore to power their real-time analytics or SaaS apps with a cloud-native database. This can be a complex and time-consuming process, as it requires significant planning and effort to ensure that data is migrated accurately and completely. Organizations face the risk of extensive downtime, data loss, disruption to business, cost vs. complexity, labor-intensive efforts and complex bi-directional setup as a fallback mechanism.
This is where Arcion comes in to alleviate the cost and complexity challenges along with reducing the risk to business continuity.
Arcion’s Change Data Capture uses a log-based approach to track changes to data, making it easy to keep the source and target systems in sync. As a result, businesses can migrate data with zero downtime, knowing that their operations will not be interrupted.
If you have a database with lots of schemas and tables, you can leverage Arcion’s built-in automatic schema conversion and schema evolution to greatly simplify your data migration project. Arcion can reduce migration budgets and timelines by a minimum of 90%.
When you move a set of data from a legacy database like Oracle to a modern database like SingleStore, you need to have a clear path for fallback in the event of an issue.
- Real-time analytics
Today’s modern businesses leverage the latest technologies within their stack, building applications with cutting-edge tools and platforms and delivering great customer experiences that are responsive and intuitive. The businesses that take it even further are those that are coupling great customer experiences and growing adoption with Big Data and AI/ML platforms, like SingleStore. The use of SingleStore unlocks a massive amount of business intelligence and real-time analytics that help businesses and their customers to thrive. These modern organizations need to move around their data for business reasons or consolidate them for federated actionable insights.
They need tools like Arcion to replicate 100s gigabytes of data daily from their transaction databases like Oracle, IBM Db2, SAP into an analytic platform like SingleStore, to power the important BI dashboards or ML models and get ahead of competition.
Q6. What kind applications are best suited for this combined offering and which are not?
Any application that needs to stay in sync with the data in real-time is a good fit. If you care about migrating an active system, or multiple active systems without downtime, you need Arcion + SingleStore. If you intend to use SingleStore to run real time advanced analytics or data applications, you’ve got very few viable choices other than Arcion.
If you only care about updates every few days or weeks, batch replication is for you. If you are moving an inactive system or a non-production system, you may not need Arcion. If offline business intelligence (say daily update) is your goal, then you might not need Arcion.
Q7 Anything else you wish to add?
Arcion and SingleStore together can bring quantifiable benefits to any organization looking to modernize their data solutions. Organizations that treat data as their first-class citizen will love the speed, flexibility, and scale powered by SingleStore + Arcion.
…………………………………..
Gary Hagmueller
Author bio: Gary Hagmueller, CEO of Arcion Labs, has been a leader in the tech industry for more than 20 years. With a deep focus on data infrastructure, AI, machine learning, and enterprise software, he has raised over $1.3 billion in debt and equity and played a key role in creating over $10 billion in enterprise value through two IPOs and four M&A exits. Previous to Arcion, he was CEO of CLARA Analytics, COO of Ayasdi, CFO of Zuora, and held many business and corporate development leadership roles. For more information on Arcion, visit www.arcion.io/, and follow the company on LinkedIn, YouTube and @ArcionLabs.
Sponsored by SingleStore