Sub-Millisecond Decisions: Marina Popova on Building a Real-Time Fraud Detection Architecture.

Q1. The core architectural decision in this system is to evaluate all fraud rules atomically within a single transaction using VoltDB’s stored procedures rather than checking rules sequentially across separate queries or services. Can you explain in practical terms what the real-world consequences are of getting that wrong? What does a fraudulent transaction that slips through a non-atomic, eventually consistent system actually look like and how often does that happen in production environments that haven’t addressed this gap?

Marina: When you have multiple rules to check to determine whether a transaction is potentially fraudulent, it is very important to do this atomically BEFORE you commit the transaction. If you are doing checks sequentially, it is possible that by the time you evaluate the last rule – conditions have changed and one of the already evaluated rules would produce a different result. You potentially will commit a transaction that would be flagged as fraudulent by the commit time. For example, an account transfer was executed, resulting in the loss of a large amount of funds.

As for how often this happens in production environments, I do not have numerical data to provide, as we are not in the finance business, but I assume the goal of any production environment is to have virtually zero such slips.

 Q2. You describe VoltDB’s TIME_WINDOW materialized views as providing pre-computed, automatically maintained aggregations with O(1) query cost inside the transaction. For database professionals unfamiliar with that approach, what is the difference in practice between querying a TIME_WINDOW view and running a standard aggregation query — and why does that difference matter so much when you are targeting sub-millisecond decision latency at high transaction volumes?

Marina: The key difference is when the work actually happens. A standard GROUP BY runs at query time – the database has to scan rows and compute the result on demand. At high transaction volumes, inside a transaction that is also evaluating multiple fraud rules, that is simply too slow.

VoltDB materialized views shift the cost to write time. When a transaction is written, the view is updated atomically as part of that same write – so when you query it, you are reading a pre-computed value that is already consistent with the latest state. The read cost is O(1). TIME_WINDOW views apply the same approach to time-based aggregations, such as “how many transactions from this account in the last 60 seconds.” That is exactly the kind of running aggregate you need for velocity-based fraud rules, and it is available at sub-millisecond cost inside the transaction where you are making the allow/deny decision. 

For more details on how Materialized views in VoltDB work.

 Q3. The geographic enrichment layer using MaxMind GeoIP2 and BigQuery’s ST_CLUSTERDBSCAN is a distinctive element of this architecture. Most fraud detection systems operate at the transaction or account level — you’ve added a spatial clustering dimension that groups threats by geographic proximity rather than by individual IP or country. What specific threat patterns become visible with that approach that would be invisible or ambiguous without it?

Marina: Most fraud detection systems look at transactions individually – is this specific transaction suspicious? Geographic clustering adds another layer: are multiple transactions across different accounts coming from the same physical area at the same time?

That distinction matters when you are dealing with coordinated attacks. A single transaction from an unusual IP is just noise. But fifty transactions hitting different accounts from the same geographic cluster within a short time window is a pattern – the kind you see with distributed account takeover, credential stuffing, or organized card fraud. Those attacks are designed to look normal at the per-account level. You only catch them when you look across accounts spatially.

More broadly, the ability to do real-time geo-based operations is useful beyond just fraud – regulatory compliance, regional service availability, detecting unusual access patterns. The point of this architecture is that you can do these kinds of spatial evaluations without sacrificing the sub-millisecond latency that the decision layer requires.

 Q4. You used Claude Code with the VoltDB Skill to generate the initial working application — including the schema, stored procedures, materialized views, client code, and integration tests. That is a significant portion of the foundational work. How did that change your development process in practice, where did the generated code hold up well under scrutiny, and where did you find yourself needing to intervene, correct, or extend what was produced?

Marina: Using Volt development skill , I was able to generate a fully functional client application (Java) and all VoltDB artifacts (schema, stored procedures, materialized views) – that worked as created. The generated code was correct, the schema had all relevant fields, stored procedures and materialized views were compiled and loaded into VoltDB flawlessly, since all this was validated via the generated integration tests as well. I found that there was almost no need to intervene – other than clearly state the goals and deliverables I wanted to create. I did some minor fine-tuning of schema and DB operations – mostly to add additional info as I was getting deeper into the implementation.

This was a significant boost and a great starting point. However, this was just a start – and there was more work to design and integrate with the GCP components, some of them are:

  • Architect the GCP-based system to ingest data from PubSub , transform and enrich it with Geo info, and land it into final analytical tables in BigQuery
  • Add publishing of transaction/request data from the generated application into GCP PubSub topics
  • Evaluate options for GeoIP lookup and their integration with GCP-based system
  • Design analytical schema in BigQuery (final tables with data we want to report on and visualize)
  • Design and implement end-to-end data transformation pipelines – from PubSub to BigQuery – using Dataform and BigQuery PubSub Subscriptions
  • Develop final analytical queries and visualize their results

 Q5. This architecture was built as a proof of concept on GCP, but the article is careful to note what a production deployment would require that the POC intentionally omits — REST API endpoints, authentication, observability, error handling, and more. For a data engineering or platform team considering taking an architecture like this from POC to production, what are the hardest gaps to close, and which production requirements tend to be most underestimated when teams are still in the design and prototyping phase?

Marina: As always, developing working code is probably only 10% of the work to get your project ready for real-life usage 🙂. There are many areas that need to be addressed to productize this solution. Some (but not all!) are:

  • Full-stack observability – as a foundation of “trust” of your system. This is one requirement that is often underestimated and bolted on as an afterthought or in response to issues found, sometimes even in production.
    •  In this POC, as a minimal “contract validation”, I have added an integration test that generates a controlled sample of test requests, processes it through the full solution, and verifies the expected results from all analytical queries in BigQuery
  • Monitoring and alerting – for exceptions/errors, resource utilization, and various health metrics 
  • Authentication and authorization – and integration with existing security frameworks, if applicable, when deploying into legacy/existing architecture
  • Access – API endpoints and/or UX-based for interactive usage
  • CI/CD
  •  SLA Requirements and validation (fault tolerance, data consistency, latency and query performance, etc.)
  •  And more

Most of them should be planned for as part of the post-POC phase and addressed as part of the development and testing from the beginning.

The good news is that all these aspects are well known, with well-understood solution patterns. Most cloud platforms have excellent integrated tools for most of them. I am especially fond of GCP tracing and Log Explorer, with options to add tracing tokens throughout the end-to-end pipelines. It does require a broader awareness and knowledge of the state of the modern data architecture, but there are numerous resources to learn from. My favorites are the ThoughtWorks analyses and reports, ByteByteGo, Medium, Industry Technical blogs and podcasts, and architecture blueprints.

…………………………………………………….

Marina Popova, Senior Systems Software Engineer, Volt Active Data.

Sponsored by Volt Active Data

You may also like...