Skip to content

"Trends and Information on AI, Big Data, Data Science, New Data Management Technologies, and Innovation."

This is the Industry Watch blog. To see the complete ODBMS.org
website with useful articles, downloads and industry information, please click here.

Mar 7 26

Technical Architecture Focus: Scaling Pandas to Petabytes: The Architecture and Tradeoffs of BigQuery DataFrames. Interview with Ivan Santa Maria Filho

by Roberto V. Zicari

Q1. You mentioned that BigFrames represents an interesting case study in “how a large company like Google can use OSS without really using OSS in the codebase.” Can you unpack this paradox?

Specifically:

  • BigFrames provides a pandas API, but the actual execution happens in BigQuery’s SQL engine via transpilation through intermediate representations (Ibis, SQLGlot). What are the fundamental architectural tradeoffs you face when creating an API-compatible layer versus actually forking and extending the original codebase?
  • From a legal/IP perspective, what considerations drive Google’s decision to reimplement APIs rather than wrap or extend existing OSS libraries? Is this purely about licensing, or are there technical benefits to the “clean room implementation” approach?
  • When you inevitably discover that certain pandas operations can’t be efficiently mapped to BigQuery SQL primitives, how do you decide between: (a) dropping that operation from your API surface, (b) implementing workarounds that might surprise users with different performance characteristics, or (c) extending BigQuery itself to support the operation natively?

Ivan Santa Maria Filho: Over the past 6 years I’ve been either leading or owning large data warehouse products. That includes Microsoft Cosmos Analytics and Azure Data Lake Analytics, and more recently leading a group in Google BigQuery called “BeyondSQL”. All three of those products are widely used by data scientists across the industry and represent more than 20 years of innovation. Cosmos Analytics and Azure Data Lake analytics have their own programming language, and BigQuery is SQL centered. 

Both approaches have their merits and limitations. While a dedicated, proprietary language allowed us to innovate at Microsoft and build an amazing product, I believe that learning a proprietary programming language is not as interesting in 2026 as it was in 2008. People change jobs more often, and quite honestly Python seems to be the winner for data scientists. SQL, while widely used and familiar, does not have the best control flow and error handling semantics. BigQuery in general continues to advance SQL with extensions like BQML, but is also betting on Python and notebooks.

I believe Python won because it is fun to use, and quite honestly easier than a lot of other languages. It is growing in complexity, but I can see how a duck-typed, interpreted language would be more attractive to someone coming from an environment like Matlab, and leveraging a wide, awesome ecosystem of freely available libraries. My take is that the Python community did an exceptional job making it a very rich ecosystem, and got several large companies to contribute. I am looking forward to all performance improvements coming down their development pipeline.

Our strategy for features, just like the product itself, is to respect where our customers are. Data scientists like Python and notebooks, so they get Python and notebooks. Because data frames are a popular data abstraction, they get BigFrames.

We tried to keep the exact same semantics like, for example, implicit ordering. By default “head(5)” has “top(5)” semantics in BigFrames, which is a costly thing to do if the underlying data is a 1PB table without an index. If the user wants performance though, they can choose to relax the ordering semantics and have results faster and cheaper.

The architecture choice considerations were all technical. Our first implementation relied heavily on Ibis, and we love it, but we are now writing our own compiler layer. We want to make the BigFrames package smaller, and add BigQuery specific features without polluting Ibis with vendor specific details. We will continue to contribute to Ibis and in many cases they remain the right choice for developers.

BigFrames does not use any proprietary APIs, anyone could write something like it, but we work where we work, and we made specific choices that only make sense for BigQuery. For instance, we use the BigQuery store read/write streaming operations instead of running a “select *” query. We also implemented a client side smart cache that supports several predicate push-down techniques that are not general at all. We would love to see people extending BigFrames to other storage systems and data warehouses, but right now we are focused on BigQuery.

My team also developed support for managed Python functions in BigQuery. Those allow users to package almost anything from the Python ecosystem into a lambda / Cloud Run style function that can be “applied” to a data frame or series. For instance, the user can write a sophisticated image transformation function in sklearn, deploy it as a user defined function, and “.apply()” that function to a multimodal column in BigQuery. They can call Hugging Face from the user function too, or even host a lightweight model in Cloud Run. We take care of deployment, garbage collection, billing, and more, and they get to use anything from the OSS ecosystem when they wish.

As you point out, we found APIs that were hard to implement on top of BigQuery. We want to cover them all, but we prioritize by crawling public git projects and notebooks and sorting the functions by the most used, and by listening to our customers.

BigFrames has averaged two releases per month, and sometimes we go in directions we were not expecting because our customers asked for them, like implementing more visualization compatibility. We were expecting users to do data preparation for AI training, and data exploration was a bit of a surprise. BigFrames went from “not good” to “pretty good” in that space over last year.


Q2.  BigFrames claims support for 150+ pandas functions, which is impressive but still a fraction of pandas’ full API surface. What are the hardest categories of pandas operations to support at BigQuery scale?

More specifically:

  • Stateful operations: Pandas allows arbitrary Python code with mutable state across operations. How do you handle operations that fundamentally assume in-memory, row-by-row iteration when your execution model is distributed SQL?
  • Ordering semantics: BigQuery DataFrames 2.0 introduced “partial ordering” mode as an optimization. Can you explain the exact semantic differences between pandas’ strict ordering guarantees and BigFrames’ partial ordering? Under what conditions does this difference become user-visible, and how do you help data scientists understand when they can safely relax ordering for performance?
  • Lazy evaluation boundaries: Pandas is eagerly evaluated; BigFrames builds a query plan. When a user calls df.head() or to_pandas(), you materialize results. How do you manage the impedance mismatch where users expect immediate feedback but you’re optimizing for deferred execution? Have you seen cases where this lazy evaluation confused users or led to unexpected costs?

Ivan Santa Maria Filho: We currently cover 850 of the approximately 1,400 Pandas functions, depending on whether you count all the supported parameter types or not. 

Making ordering flexible is a very common design compromise for frameworks trying to make Pandas scale. For BigFrames we decided to let users choose the behavior they prefer. They can choose Pandas semantics with strict (consistent) ordering of rows, and calling an operator like “head()” multiple times will yield the same results every time, which requires the equivalent of an ORDER BY clause. This is expensive, and for complex indices, requires us to compute a column. If the user does not care about the ordering semantics, they can set a flag and BigFrames will avoid the ORDER BY operation. We also log warnings for all APIs that have implicit logging and, of course, allow the user to suppress the warning.

In some cases the user will be able to see a computed column with the complex index, which can cause compatibility issues. If the user explicitly names the columns they want, they see nothing. If they do not, they see any computed column we add. 

The lazy evaluation is another interesting compromise. BigQuery runs on top of really big clusters, with tens of thousands of servers each. It is designed to run complex queries, and has an advanced optimizer. The reason we do lazy evaluation is because all Pandas APIs are transformed into an abstract syntax tree, and the actual operations are pending execution. A BigFrames data frame is a “promise” of a data frame – a name, and a pending log of operations. When we execute the operations, they are all combined by the optimizer. We might detect that a later filter would remove rows from an earlier operation and filter first.

Map-reduce systems have always dealt with choices like “should we sort the data then hash it for a join, or should we hash, join then shuffle sort?”. By using lazy execution we give ourselves a chance to use the optimizations and save the user money and time. Depending on how the user is paying for BigQuery, the amount of scanned data matters for cost and we are, again, 100% focused on customers. The first version of BigFrames we shipped was too expensive, and today we are on par with SQL.

When it comes to stateful operations, we support it in two ways. The data frames in BigFrames are more of a promise of a data frame than an actual data frame. When reading data from BigQuery the data frame contains a reference to a server side snapshot of the table. When writing to BigQuery the append operations are kept local until enough changes accumulate and we flush them to a temp table, or the user does an operation that triggers the flush. The data frame also contains a log of pending transformations. The user can call execute() on the data frame and BigFrames will apply the transformations locally if possible, or just fetch the results, which will cause a global optimization of pending transformations and a server call. The server call might be a direct storage operation (read/write) or a SQL job.

We also support Python UDFs, and those can retain state themselves. When the user performs an “apply(function)” operation, the function might be a remote function, which supports full web applications as backend, or a Python Managed function. The user can, for instance, create a remote function that connects to Hugging Face, download a transformer, cache it offline, and expose an API call to BigQuery. We will only initialize the web application when we launch it or add new instances of it, but every call to the UDF will benefit from the state of the server. 


Q3. BigQuery’s UDF story has evolved from SQL/JavaScript UDFs that run in-process, to remote functions that call out to Cloud Functions, and now BigFrames 2.0 adds Python UDFs with a @udf decorator. Can you walk us through the architectural evolution and the limitations each approach addresses?

In particular:

  • Execution model tradeoffs: Running Python UDFs via Cloud Functions means network round-trips for every batch of rows. What’s the performance penalty in practice, and how do you amortize this cost through batching strategies? How large do result sets need to be before remote UDF overhead dominates total query time?
  • State management: Traditional UDFs can’t maintain state across invocations (by design, for parallelization). But data scientists often want to do things like “apply this pretrained ML model to every row” where loading the model once and reusing it would be far more efficient. How does BigFrames handle this? Can you cache model objects across UDF invocations, or does every batch reload from scratch?
  • Error handling and debugging: When a Python UDF crashes on row 4,782,391 of a 10-million-row table, how do data scientists debug this? What visibility do you provide into UDF execution, and how do you balance comprehensive logging with the cost/performance implications of collecting it at scale?
  • Security boundaries: Allowing arbitrary Python code to run is a massive security surface. How do you sandbox UDF execution to prevent: (a) accessing other customers’ data, (b) egress of sensitive data, (c) abuse of compute resources (crypto mining, etc.)?

Ivan Santa Maria Filho: I think it is important to say the UDFs are used by BigFrames, but users don’t need BigFrames to use them. They can declare and use them from SQL. We did not want to create a proprietary API for this, so we extended the public SQL API instead. This is a recurring theme for our team.

We expect the UDF space to evolve a lot in 2026 and 2027. BigQuery supports SQL UDFs, JavaScript UDFs, Remote Functions, and now Python managed UDFs. JS runs in a sandbox, which is itself inside a nested VM, running on the same set of machines as BigQuery workers. There is no network cost, but there are costs to launch the VM and inter process costs too. For remote and managed UDFs we currently run them on Cloud Run, and we have the network costs. What we do for those is to batch rows to amortize costs, and we have invested a significant amount of time to make the serialization and deserialization costs low.

This might sound counter-intuitive, but the biggest performance problem is not the network. The biggest challenge for us is to teach the optimizer how much individual UDFs take to process a row, and how many parallel calls we should be making, with how many rows on each call. For our first iteration we will ask users to help us by setting core counts, ram and concurrency level. We will give them telemetry and logging to let them make that call. Over time we want to watch the UDFs and adjust the settings automatically, but that will come later.

For your specific question, we support fairly complex UDFs. One of my first tests was to call Hugging Face from the UDF and set up a local pipeline (local to the UDF runtime, in Cloud Run). The UDF had two dozen Python functions defined, one to fetch my developer keys from our key service (KMS), another to take the key and download a text pipeline from hugging face, another to store the weights and setup a local cache, and so on. One of those Python functions was the UDF entry point.

When we instantiate the UDF, or auto-scale it by adding instances, we run the UDF body as if it was a main function in Python. I used that to setup the stateful model locally in the Cloud Run instance. When BigQuery calls the UDF, it calls the entry point function. You can find a similar example calling Google’s translation APIs – the client is instantiated only once.

We are considering a Python UDF version that runs in the shard like the JavaScript UDF, but it will depend on customer demand.

Error handling with data frames and Python is one of the advantages this approach has over SQL. If the user calls a function per data frame row, they can assign the return code to another data frame column. Then later use a filter to retry only the failed rows. SQL in general would force the user to retry the query again, which would run every row again. For example, let’s say you want to send emails to customers matching a given criteria using UDFs and SQL. Then assume that “SELECT send_email(customer_email) WHERE …” would select 10k users. If the send_email function fails for any of them, BigQuery would retry the entire job. The assumption of the SQL language is that send_mail() has no side effects until the entire job is successful, which is very likely not true. This is a very easy way to spam customers. Using Python and “apply()” the send_mail UDF can return a fail/pass return code, and a simple while loop can retry only the failed rows using a filter. This is also doable in SQL, but it is hard enough that makes for a good interview question.

Security is very important. Google enforces that all services and microservices have multiple security boundaries. For code running in the same machine as BigQuery processes, for example, user code runs on a sandbox, and the sandbox inside a gVisor VM. The gVisor VM has no IO stack, and very limited surface, and that is the public part of the solution. We have additional hardware, software, and network controls in place. 

For managed Python you can safely assume we have at least the same mitigations in place, very robust monitoring, plus we deploy the code to Cloud Run, which sits on another cluster using a restricted configuration. For functions running in Cloud Run it is possible to access the Internet, but the user has to specify a connection configuration, which includes a service account, grant that service account the correct permissions, and make sure the VPC settings in their project allows it. If the project is configured to have internet access, the UDF creator has the right to create service accounts and connections, and permissions to access the internet, then it is possible to copy the data outside Google. By default there is no Internet access, so the user has to do work to enable it.


Q4. You mentioned BigFrames would “certainly explain the limitations of BigQuery.” Let’s dig into that. What are the most significant BigQuery architectural decisions that constrain what BigFrames can do, and how do these manifest as surprising limitations for users?

For example:

  • Storage format constraints: BigQuery’s columnar storage and partitioning strategy presumably makes some pandas operations prohibitively expensive. What operations fall into this category? Are there pandas patterns that work fine on 10GB but break completely at 10TB due to BigQuery’s architecture?
  • Type system mismatches: Pandas supports Python’s dynamic typing; BigQuery has a strict schema. How do you handle cases where a pandas operation would dynamically change column types based on data content? Do you fail at query planning time, or try to infer schemas and potentially fail at execution time?
  • Result size limits: BigQuery DataFrames 2.0 changed allow_large_results to default to False, failing queries that return >10GB compressed data. This is a dramatic departure from pandas’ “it fits in RAM or it doesn’t” model. How do you help users understand when they’re bumping against this limit, and what patterns do you recommend for working around it (beyond just “set the flag to True”)?
  • Transaction semantics: Pandas DataFrames are just objects; mutations are immediate and in-memory. BigFrames operations compile to queries. What happens when users expect ACID transaction semantics (e.g., “update these 3 tables atomically”) but you’re generating separate SQL statements?

Ivan Santa Maria Filho: BigQuery is designed to support SQL, to scale to datasets with PBs of data, and to use highly optimized, controlled SQL engine operators. For what it was designed it works exceptionally well. When it comes to running arbitrary user code, I believe we could do much more.

Many choices get harder at scale. The simplest one to describe is supporting the implicit ordering of rows. If you have 1GB of data, dropping an index and computing a new one will take a couple of seconds. If you have 10TB that will take longer, maybe not linearly longer, but longer. There is no magical way to fix this problem.

We could pull a page from RDBMS and use a B-Tree and clustering keys as storage, but BigQuery reads data from multiple partitions in parallel, and the data would return in random order. We could use a single partition for data frames storage, but that would limit scale and performance. It would also force a table rebuild when the index changes. We could use B-Trees and secondary indices to simulating a table scan. We could inject sort operators over a computed index column. Every option consumes time and raises the cost to our users.

We are offering the Pandas semantics by default, so users are not surprised, but also a mode more similar to what Polars and databases do. If our customers tell us this is acceptable, we would make it the default, otherwise continue to look for the best way to gain scale with the Pandas semantics.

The type mismatches are always a problem. Python uses duck-typing, but it also supports a very rich type system, with several Python libraries having their own data types, both simple and complex types. BigQuery is strongly typed, so we cannot just pass the bytes around, we have to convert from what is stored in the BigQuery cells to something that makes sense in Python. Those conversions can be expensive, particularly if the user is applying a UDF to a column or data frame. The data will be in BigQuery and passed to the UDF row wise  or column wise depending on the call syntax, and the way that works is, BigQuery will partition the table holding the data frame data, and send each partition to a worker. This worker will read the data from our store and send it to the worker hosting the UDF. We do what we can to optimize this step, but that does not change the fact that the data in the store is in a different encoding than what Python expects. Even timestamps have different resolutions in BigQuery.

The result set size has a dual purpose. Certain operations have no inherent limits other than BigQuery limits. Applying a UDFs over rows will scale well, and because of it the user might even realize they are scanning hundreds of TBs of data. That can become really expensive, and the only billing surprise we like is when the price is lower than expected. The size limit is an attempt to avoid bad surprises.

The other purpose is to avoid crashing a notebook. If the user tries to render 10GB of data points in a notebook widget, odds are that will crash the notebook. One unique problem with very large datasets and series is that one cannot just plot every point. They also cannot just naively sample the data because they might miss a maximum, minimum, or anomalous data point. We are considering adding decimation algorithms to reduce the granularity of the series but retain its shape, maybe building that into BigFrames, but ideally contributing this to an OSS project.

As far as acid semantics go, BigFrames does not support complex transaction boundaries. There is no way to express that changes to two data frames should both be committed or not committed. That said, for a single data frame BigFrames uses “copy on mutate” approach, writing all changes to a new “backing table” then linking the client object to the resulting table if everything goes right. We could investigate a way to have cross-data frame transactions, but never got that requirement.


Q5.  Looking forward, we’re seeing an explosion of “pandas-like” APIs: Dask, Modin, Polars, BigFrames, Snowpark Python, Databricks pandas API on Spark. Is the data science ecosystem converging toward pandas as a universal interface, or are we headed for fragmentation as each implementation adds vendor-specific extensions?

More philosophically:

  • API surface versioning: Pandas releases new versions regularly with API changes. How does BigFrames handle pandas version compatibility? Do you target a specific pandas version, or try to track the latest? What happens when pandas adds a feature you can’t support efficiently in BigQuery?
  • Beyond pandas: You mentioned that BigFrames 2.0 adds multimodal capabilities for unstructured data (images, text). Pandas wasn’t designed for this. At what point does extending the pandas API for new use cases become counterproductive, and you should just design a new API that’s purpose-built for distributed, multimodal data processing?
  • ML integration: BigFrames includes bigframes.ml with a scikit-learn-like API for BigQuery ML. But modern ML workflows involve PyTorch, TensorFlow, Hugging Face transformers, etc. How do you see the integration of these frameworks evolving? Will we see bigframes.torch or bigframes.transformers, or is there a fundamental mismatch between these frameworks’ execution models and BigQuery’s architecture?
  • Standards vs. ecosystems: Would the data science community benefit from a formal standard for “distributed dataframe APIs” (similar to how SQL standardized relational queries), or is the current Cambrian explosion of implementations actually healthy for innovation?

Ivan Santa Maria Filho: For API versioning, we follow the same model the OSS community does, with major and minor versions. We are expecting many large updates from Python and Pandas this year, and keeping up with the changes. 

My take is that the ecosystem will continue to fragment for a while, and that is not necessarily bad. We have enough innovation in this space that both clients and backends are evolving and have diverse feature sets. It is quite hard to offer a smooth, common surface across backends, without compromising performance and / or cost. By the time any industry gets to be fully standardized, that is usually the time it is also commoditized, and investment slows.

The BigQuery team added support for multi-modal data, auto-generation of embeddings, and auto-quantization of models, making extraction and inferencing way cheaper. Most data in enterprises everywhere is not structured. The amount of data stored in documents, intranet pages, email, calendars, and collaboration / chat tools is way higher than data curated in tables. 

I don’t see the point of hiding this functionality from customers, but I also don’t want to pollute the Pandas API namespace. We try to be as explicit as possible, so users know what is, and what is not a Pandas default API, but we make our extensions interoperable. 

For example, it is fairly easy to perform sentiment analysis on a support phone call audio recording, then join the sentiment and user data in BigQuery so a CRM application can track how happy the customer was, and what were the issues they cared about.

It is getting increasingly easy to instruct an agent to watch the general sentiment around a product and only warn us when something changes. 

The development around agents makes it harder to predict the future of Pandas-like frameworks. Given the current investment level, fragmentation is a natural evolution of this space, but if we achieve an agentic solution that produces results by answering questions in English, the mechanisms to handle data will be less popular.

The agents themselves will need a language to express what they want, but the number of direct active users might go down drastically. We might finally end up with something similar to the Star Trek Enterprise computer, and at that point I just don’t see a regular data scientist or business analyst writing Python directly. 

…………………………………………………………………………

Ivan Santa Maria Filho has a BSc and MSc in computer science and a wide variety of experiences as individual contributor and manager, having owned a small software company and worked on multiple billion dollar products and services at Microsoft, Meta and Google. His main areas of expertise include vertical integration of stateful, large scale services with ephemeral VM infrastructure, and the infrastructure itself. Ivan Santa Maria Filho has a BSc and MSc in computer science and a wide variety of experiences as individual contributor and manager, having owned a small software company and worked on multiple billion dollar products and services at Microsoft, Meta and Google. His main areas of expertise include vertical integration of stateful, large scale services with ephemeral VM infrastructure, and the infrastructure itself.




Additional Context for ODBMS.org Readers:

What is BigFrames? BigQuery DataFrames (BigFrames) is an open-source Python library that provides a pandas-compatible API for analyzing data stored in BigQuery. Unlike pandas, which loads data into local memory, BigFrames translates operations into BigQuery SQL, enabling data scientists to work with terabyte-scale datasets using familiar pandas syntax.

Why does this matter? Most data scientists learn pandas, but pandas doesn’t scale beyond single-machine memory limits. BigFrames (and competitors like Databricks pandas API, Snowpark Python) represent a new generation of tools that preserve familiar APIs while transparently distributing computation. Understanding the tradeoffs in these systems helps organizations choose the right tools and helps researchers understand the limits of API compatibility.

Key Technical Innovation: BigFrames uses a transpilation approach: pandas operations → Ibis intermediate representation → SQLGlot SQL generation → BigQuery execution. This allows Google to avoid directly bundling pandas code while maintaining API compatibility – a fascinating case study in software architecture and licensing strategy.

……………………………..

Follow us on X

Follow us on LinkedIn

Edit this

Feb 9 26

On AI and the Future of Rail Systems: Interview with Roland Edel

by Roberto V. Zicari

“AI reshapes rail jobs by reducing repetitive tasks and giving staff more responsibility for decision‑making. It also enables engineers and project teams to focus more on innovative and creative work, as well as to deliver complex rail projects on time and on budget. Technicians work increasingly data‑driven, dispatchers make better‑informed decisions, and drivers gradually move into supervisory roles for automated systems.”

Q1. As CTO of Siemens Mobility, you oversee one of the world’s most critical transportation infrastructure portfolios. When you look at the global rail industry today, where do you see AI and advanced algorithms creating the most transformative opportunities—not just for operational efficiency, but for fundamentally reimagining how rail systems serve cities and nations? What convinced you that AI was no longer optional but essential for the future of mobility?

Roland Edel: Data and Artificial Intelligence already make rail transport faster, more stable and more reliable—often without passengers even noticing. Today, AI detects early deviations in vehicles and infrastructure, analyses camera data and prevents disruptions before they materialize.

The next major step in the long run is Driverless Train Operations (DTO) with a Grade of Automation (GoA) 3 in mainline operations. In earlier projects such as BerDiBa and safe.trAIn, we developed foundational technologies that we are now applying in current projects like R2DATO and RemODtrAIn. Here, we are shaping the transition from semi‑automated operations (GoA2), including our ATO over ETCS project with S‑Bahn Hamburg, to fully automated operations (GoA4) or remote operations in stabling areas.

This requires close integration of onboard intelligence, sensors, digital infrastructure and signalling. These technologies lay the foundation for a system that can scale reliably even as demand grows.

For me, the turning point in our automation projects came when data on optimized train planning and energy savings made one thing unmistakably clear: analytics, algorithms and AI deliver tangible operational benefits—from more efficient planning to reduced energy consumption and more stable performance.

Q2. Many industries struggle to move AI initiatives from successful pilot programs to enterprise‑wide implementation. Rail systems are particularly complex—they involve safety‑critical operations, legacy infrastructure, multiple stakeholders, and regulatory frameworks that prioritize reliability above all else. What have been the biggest organizational and operational challenges you’ve encountered in scaling AI applications across Siemens Mobility’s rail portfolio, and how have you approached the tension between innovation and the rail industry’s paramount focus on safety?

Roland Edel: Scaling AI in the rail domain works only if we are able to incorporate safety‑critical functions into our innovations. Safety logic remains deterministic and certified; AI is added only where it is fully verifiable. Deployment follows a stepwise approach: first in depots, then in shunting areas, and later on the mainline.

Projects such as AutomatedTrain and others, in which we collaborate closely with an ecosystem of external partners, demonstrate how essential robust error detection and sensor fusion are for ensuring safe perception in open environments. At the same time, modern tools allow us to update safety‑relevant software during ongoing operations, keeping systems updated without compromising availability.

This combination—clear boundaries, strong diagnostics and incremental rollout—has proven to be the right way to balance innovation with the industry’s uncompromising safety culture. Finally, it all comes down to people: we can only scale AI when we train our employees accordingly and embed data and AI into all our processes.

Q3. AI is only as good as the data it learns from. Rail systems generate enormous amounts of operational data, but often in silos. From a leadership perspective, what does it take to build the data infrastructure that makes AI in rail reliable? How do you convince diverse stakeholders to share and standardize data?

Roland Edel: Trustworthy AI requires trustworthy data across the entire lifecycle of a rail system. That is why we increasingly rely on digital twins that connect design, engineering, manufacturing, operations and servicing. From the first CAD model to condition‑based maintenance and real‑time operations, a digital twin ensures that data remains consistent, interoperable and available wherever it is needed.

Open interfaces, standardized data models and federated platforms make this possible in practice. Our Railigent X suite plays a central role by integrating engineering data, vehicle data, infrastructure information and operational insights, while keeping operators in full control of their data.

When lifecycle data becomes interoperable, system availability improves, analytics become more precise, and the entire network operates more reliably and economically. And this is where stakeholders become convinced: when real projects demonstrate better services, higher reliability, improved cost structures and full data sovereignty. Once these benefits are visible, data collaboration stops being a hurdle and becomes an accelerator for innovation.

Q4. Predictive maintenance is often cited as AI’s ‘killer application.’ What is the realistic business case, and what has surprised you most about what it takes to make it work?

Roland Edel: Predictive maintenance delivers measurable business value: higher availability, reduced lifecycle costs and more efficient maintenance planning. AI uncovers patterns that humans cannot detect and enables precisely timed interventions.

What surprised me most was that cultural change often matters more than the algorithms themselves. Teams need to take into account the predictions, understand their implications and adapt work processes accordingly. Financially, the payoff is significant but requires patience—it is a long‑term investment.

The next step is what we call Predictive Availability, where entire functional chains—not just single components—remain stable. This includes linking data from incident reports, diagnostics, measurements, visual inspections and operational context into one lifecycle digital twin. This system understanding allows AI to anticipate disruptions earlier and more reliably.

The approach works well already, but its full potential depends on even closer collaboration across the ecosystem.

Q5. The rail industry is exploring different levels of automation. What framework do you use to decide what to automate first, and how do you balance safety, public trust and workforce concerns?

Roland Edel: We automate according to a clear framework: start where the environment is controlled and the benefits are greatest. Depots are ideal—they offer structured, repeatable processes with high potential for efficiency gains. Automation then moves to stabling and shunting yards, supported by AI‑driven obstacle detection and remote operation. From there, automation can be extended progressively.

At the same time, the human role remains central. Rare, complex edge cases are still best handled by experienced staff, so automation supports people rather than replaces them. Public trust grows when the benefits are transparent, greater safety, greater punctuality, fewer routine tasks, and when rollout is gradual. Each phase builds experience and confidence for the next.

Q6. Rail is already energy efficient. How big is AI’s role in sustainability, and how do you manage trade-offs?

Roland Edel: AI is one of the strongest levers for energy efficiency in rail transport. Automated driving profiles reduce energy consumption, maximize regenerative braking and minimize wear. AI‑based timetable optimization smooths traffic flows and prevents unnecessary stop‑and‑go patterns. To unlock these benefits across the entire network, data from vehicles, infrastructure and operations must be integrated. That is why we have introduced Siemens Xcelerator principles across our portfolio—Railigent X, Signaling X and the Mobility Software Suite X—to create modular cloud‑based software, interoperable APIs and an open ecosystem. Trade‑offs between energy efficiency and service frequency can be managed intelligently: AI enables the optimization of both simultaneously by balancing demand, capacity and operational constraints in real time.

Q7. AI and automation raise important questions about the future of work in rail. How do you approach workforce concerns, and what skills will be needed?

Roland Edel: AI reshapes rail jobs by reducing repetitive tasks and giving staff more responsibility for decision‑making. It also enables engineers and project teams to focus more on innovative and creative work, as well as to deliver complex rail projects on time and on budget. Technicians work increasingly data‑driven, dispatchers make better‑informed decisions, and drivers gradually move into supervisory roles for automated systems.

To support this shift, we invest in targeted training: digital learning platforms, simulation environments and hands‑on programs that build confidence in new tools. AI does not eliminate jobs; it modernizes them, creating more attractive, safer roles with clearer career perspectives.

Q8. Rail is heavily regulated. How do you work with regulators to build confidence in AI, and how do you earn public trust?

Roland Edel: Regulators are rightly accustomed to deterministic, fully explainable systems. We therefore involve them early—long before an AI‑based function enters the approval process. Together with our partner ecosystem, we develop methods to make AI systems traceable, testable and auditable, including virtual testbeds, robust perception validation and hybrid architectures that ensure safety‑critical logic remains reliable and predictable.

The overall system must remain predictable, and every AI‑supported decision must stay within defined boundaries. Continuous monitoring is essential: sensors and algorithms must detect when they deviate from expected performance and transition into safe states. Public trust grows through transparency, real‑world performance and a phased introduction—starting in controlled environments like depots and only later in passenger service.

Q9. Looking ahead to 2030, what does a realistic AI‑enabled rail system look like? And what challenges keep you up at night?

Roland Edel: By 2030, AI will be an almost invisible yet essential part of rail operations. Passengers will benefit from more reliable services, clearer information and smoother journeys. Data and AI will also enable highly personalized mobility services—from multimodal Mobility‑as‑a‑Service offerings to AI‑powered travel companions that proactively guide passengers throughout their journey.

Operators will rely on cloud‑based signaling, automated depots, predictive maintenance and digital supply chains. The system will become more resilient, flexible and climate‑friendly, and new applications will emerge. Three challenges remain. First, regulation and standards must evolve quickly enough to keep pace with innovation while maintaining safety. Second, the industry needs broader data and architecture harmonization across operators, suppliers and infrastructure owners. Third, workforce transformation must accelerate to align skills with new technologies.

To shape the Data & AI transformation in rail, we must open our data and platforms, modularize software, build digital twins and trustworthy industrial AI, strengthen ecosystem partnerships and accelerate deployment with confidence and purpose.

………………………………………………………………………………………………………

Siemens Erlangen ROLAND EDEL

Roland Edel has been Chief Technology Officer and Head of Technology & Innovation at Siemens AG’s Mobility & Logistics Division in Munich since 2011. Since October 2014 the Division is conducted under the name Mobility.

After joining Siemens AG in Erlangen in 1993 as a design and development engineer at Transportation Systems, Roland Edel went on to assume various managerial roles within the former Electrification Division between 1996 and 2003. From 2003 onwards he was responsible for Engineering, Development and Product Management within the Business Unit Rail Electrification for five years. Roland Edel subsequently took charge of engineering and development within the newly formed Business Unit Turnkey, Electrification and Transrapid in Erlangen, before moving on to assume the position of Chief Technology Officer and Head of Innovative Mobility Solutions in the Business Unit Complete Transportation in 2009.

Resources:

Digital Transformation for Rail, Siemens Mobility.

……………………………..

Follow us on X

Follow us on LinkedIn

Nov 25 25

Twenty Years of Conversations: Reflections on Technology and Society

by Roberto V. Zicari

By Roberto V. Zicari, Editor, ODBMS.org

“Because ultimately, what these twenty years of dialogue have taught me is that technology is never just about the technology. It’s about us, and the world we choose to build together.”

When I launched ODBMS.org in 2005, the technology landscape looked remarkably different. Object databases were the conversation. SQL versus NoSQL was a heated debate. The cloud was still a meteorological term for most developers. Twenty years and hundreds of interviews later, what strikes me most isn’t just how much technology has changed, but how profoundly it has reshaped the questions we ask.

In those early years, our conversations centered on technical elegance—data models, query optimization, transactional consistency. We debated whether object-relational mapping would bridge two worlds or create new complexities. These were important questions, but they were questions about technology itself.

Today’s conversations reveal a different world. When I interview leaders now, we discuss trust frameworks for AI in clinical care, the societal implications of real-time data streams that move billions of dollars in milliseconds, the responsibility that comes with systems that make life-or-death healthcare decisions. The technology hasn’t just gotten faster or more powerful—it has become deeply embedded in the fabric of human decision-making.

This evolution reflects something fundamental: we’ve moved from asking “Can we build this?” to asking “Should we build this?” and “What happens when we do?” The practitioners I’ve spoken with over two decades—from Vinton Cerf discussing internet governance to recent conversations about AI ethics and trustworthy systems—increasingly grapple with questions that transcend engineering.

The patterns that emerge from twenty years of dialogue are striking. First, the acceleration is real and relentless. A database professional from 2004 measuring latency in hundreds of milliseconds would be stunned by today’s nanosecond-level systems. But speed alone tells an incomplete story. What matters more is the expanding scope of impact. Systems that once managed business transactions now influence medical treatments, shape financial markets, and mediate human knowledge.

Second, every technological breakthrough creates new responsibilities. The Big Data revolution promised insights; it delivered privacy challenges. Cloud computing promised accessibility; it raised questions about data sovereignty. Generative AI promises creativity; it demands frameworks for attribution, bias, and trust. Each wave of innovation brings not just solutions but new ethical territories to navigate.

Third, the gap between possibility and wisdom persists. We can build systems of remarkable sophistication, yet we struggle with governance, interpretability, and equitable access. The technical challenges we once obsessed over—scalability, performance, reliability—now seem almost quaint compared to the societal challenges of ensuring technology serves humanity rather than destabilizing it.

Perhaps most significantly, I’ve watched the democratization of technology amplify both its potential and its risks. Open source movements have accelerated innovation beyond what any single corporation could achieve. Yet this same openness means that powerful capabilities spread faster than our collective wisdom about their use.

Looking back through twenty years of expert articles and interviews, I see an arc from technical optimism to responsible pragmatism. The pioneers I spoke with in 2005 were building the future with enthusiasm and relatively few constraints. Today’s innovators build with one eye on capability and another on consequence. They think not just about systems that work, but about systems that work for society.

The database and data management community has always been at the intersection of possibility and reality. We store, structure, and serve the information that powers decisions. Now, as that information flows through AI systems and influences outcomes at unprecedented scale, our responsibility extends beyond technical excellence to social awareness.

As ODBMS.org enters its third decade, we are more committed than ever to addressing these pressing issues head-on. The portal has evolved to tackle the urgent questions emerging from the generative AI era—questions about trustworthy AI systems, responsible deployment, bias and fairness, data provenance, and the governance frameworks needed for AI in critical domains like healthcare and finance. Our conversations now explore not just how these systems work, but how we ensure they work ethically and equitably.

The core mission remains: to create a space where practitioners, researchers, and leaders can share not just their technical insights, but their wisdom about building technology that serves human flourishing. In this new era of generative AI, that mission has never been more vital. Because ultimately, what these twenty years of dialogue have taught me is that technology is never just about the technology. It’s about us, and the world we choose to build together.

Nov 10 25

Community Over Code: Ruth Suehle on Leading The Apache Software Foundation into the Future

by Roberto V. Zicari

“Open communication, consensus, and collaboration are the heart of The Apache Way and always have been. That’s why you hear us say “community over code.”

Foundation Mission & Leadership

Q1. As President of The Apache Software Foundation, you’re leading one of the world’s most influential open-source organizations at a particularly dynamic moment in technology history. Can you share your vision for ASF’s mission today and how it has evolved? What does “The Apache Way”—the foundation’s collaborative, consensus-driven approach to software development—mean in 2025, and why do you believe this methodology remains vital as the software landscape becomes increasingly complex and commercially driven?

Ruth Suehle: The ASF has been around for more than 25 years, which has given us a lot of time with developing software collaboratively, and plenty of lessons learned along the way. The Apache Way is the name for our time-tested approach to open source development, but it’s not a set of policies or demands. We have hundreds of projects, each with their own culture, activities, and stage of development. As a whole, however, the ASF’s long-held belief is that open source software thrives best when it remains independent of any single or dominant commercial interests. The Apache Way gives all of those diverse projects a framework for maintaining neutrality and independence. This ensures that our projects serve the broader community.

It’s built around a few concepts, the first of which leads the rest, and that is earned authority. The ASF is built on a web of trust and publicly earned merit, which does not expire. The community is entirely volunteer-based (though of course many are paid by companies to work on projects housed at The ASF, as they are for any code-producing foundation), and votes are all equal. 

Open communication, consensus, and collaboration are the heart of The Apache Way and always have been. That’s why you hear us say “community over code.” A strong and healthy community comes first, because a good community can fix bad code, but good code can’t heal a struggling community.

Q2. The Apache Software Foundation oversees hundreds of projects spanning everything from web servers to big data platforms to AI/ML frameworks. Looking across this diverse portfolio, what are the common threads or emerging patterns you’re seeing? Are there specific technical domains or project types where you’re seeing the most energy, innovation, or community growth? And conversely, are there areas where ASF projects face particular sustainability or relevance challenges?

Ruth Suehle: We actually map projects by category at projects.apache.org, so anyone is welcome to take a look and see where things lie today. What you mostly won’t see reflected there, however, are our projects in the Incubator, which is how new projects come into the foundation. The newest things there at any given time are likely to be reflections of broader trends in technology, and right now the latest additions are largely data-related.

It’s worth noting the other end of the lifecycle, as well: the Apache Attic. This is how we officially retire and archive projects, and it’s an important feature for the foundation and how we support a full project lifecycle. By ensuring transparency and providing a formal process for projects that are no longer under active development,the Attic acts as a historical archive, moving projects to a read-only state to preserve their code and documentation for users, while ceasing new development and providing limited oversight to allow for future maintenance if needed.

As for sustainability, I see this not as an ASF challenge or that of a particular project, but as a difficulty facing the entire open source ecosystem right now. I’ve given talks and led panels at a few events in the last year on the subject. It was a significant topic at this year’s Open Source Congress. When you say “sustainability,” people tend to hear “funding,” and that is an important factor, but it’s more complicated than just money. That said, complying with coming regulatory changes, notably the Cyber Resilience Act (CRA), is going to impose significant additional costs on open source projects and foundations. This year we launched our Tooling Initiative to address those concerns, and it’s the first of our ASF Initiatives, which offer targeted sponsorships for specific needs.

Current Projects & Strategic Directions

Q3. Apache has been foundational to the big data revolution with projects like Hadoop, Spark, Kafka, and Flink. As we move into the GenAI era, how are these established projects evolving to serve new workloads and use cases? Are you seeing Apache projects positioning themselves as critical infrastructure for AI applications—for instance, in data pipelines feeding LLMs, vector databases, or real-time inference systems? What role do you envision Apache projects playing in the broader AI infrastructure stack?

Ruth Suehle: Apache projects are not just evolving for the GenAI era—they are actively positioning themselves as critical infrastructure for AI applications, particularly in the domain of data pipelines, real-time context, and orchestration. The shift is from “batch big data” to “real-time, contextualized data streams” that feed LLMs and power real-time inference.

As you state, existing ASF projects are already well-positioned to plug right into the AI ecosystem. Apache Kafka can act as a mission-critical data fabric for generative AI applications, while Apache Flink’s focus on stateful, low-latency, and event-time stream processing is ideal for AI workflows. Apache Spark, Apache Airflow, and Apache Beam all fit well as tools to manage tasks like large-scale data preparation, workflow orchestration, and data abstraction. Two years ago, Apache Pinot added support for real-time vector ingestion in 2023 to enable similarity search as a real-time operation, addressing the need for immediate updates in generative AI pipelines. So Apache projects are not just migrating their existing functionality; they are fundamentally being adapted to own the data layer within AI infrastructure stacks.

Q4. Beyond the well-known flagship projects, what are some emerging or underappreciated Apache projects that you’re particularly excited about? Are there incubating projects or recent graduates from the Apache Incubator that you believe represent important directions for the foundation? What makes these projects significant, and what do they tell us about where the Apache community sees future opportunities?

Ruth Suehle: I can’t even pick favorite songs and movies, much less favorite projects! But seriously, this question is more like picking which of your children you think is the most promising. A huge part of our underlying ethos and governance at the ASF is supporting all projects equally and encouraging all of our projects to be as successful as possible. Their independence and unique communities, coupled with the incredible innovation we tend to see across all open source projects, means that any of our Incubator projects have the potential to bring significant innovation and advancement in their areas. 

Q5. As President, what specific directions would you personally like to move The Apache Software Foundation forward? Are there strategic initiatives—whether technical, organizational, or community-focused—that you’re championing? This could range from attracting new types of projects, expanding global community participation, improving project sustainability models, or addressing gaps in the open-source ecosystem that ASF is uniquely positioned to fill.

Ruth Suehle: I mentioned earlier that when people hear “sustainability,” they often hear “money,” but it means other things as well. Fundamentally, sustainability is “what do we need to do to ensure the success of the open source ecosystem for decades to come?” One of the biggest changes I’ve seen in the last two or three years is a highly beneficial one, and that is a move towards more collaboration across the foundations, industry, and project communities. These groups have spent many years working largely as silos, which was fine when the work was all about individual software projects, but we’re facing more and more issues that are best solved by doing the thing that we all know best–collaboration. For The ASF, participating in groups like the Eclipse Foundation’s Open Regulatory Compliance Working Group, in our role as Open Source Initiative Affiliate members, and through partnerships like we have with Alpha-Omega help us reach solutions to common problems the open source way instead of constantly reinventing the wheel. Earlier this year, I was elected to the OSI board to represent the OSI’s Affiliate members, and I think the OSI’s work to bring together organizations through the Affiliate program and things like the Open Policy Alliance are great examples of this kind of cooperation that is not only the way forward for the entire ecosystem, but critical to continued success.

Another important piece of change we need for sustainability is doing a better job of growing a talent pipeline in open source. “Open source” got a lot of mainstream press for about 3 years after the term was coined in 1998, and then we all rather quietly built this massive ecosystem, again largely in silos. In 2025, that code is quite literally running the world, and there’s a lot more of it than there used to be. There are larger needs around it than there used to be. But the pool of maintainers has not grown at the same rate, and one place I think we really failed in all of open source is making sure we were bringing in new talent to keep up with the pace that we were creating at. We have plenty of room for improvement in preparing the next generation, and we have to keep building our people. 

Simply put, we have a mentorship problem. I believe a large reason for that is that those who built open source software in the early years were doing exactly that–building from scratch. They may have had mentors in writing code, but they didn’t have mentors in open source, because they were writing the playbook as they went. As a result, they also didn’t have mentors in mentoring, i.e., a model to look to when mentoring the next generation of open source contributors.There are still a lot of folks around who have been here since roughly 1998, when the term “open source” was coined, or shortly thereafter. I don’t like the math, but the fact is that those people are retiring (or at least might like to one day!), and when I look around the room at events and on mailing lists, I’m not seeing enough new faces to keep up. 

Future Vision & Community

Q6. Looking ahead three to five years, what does success look like for The Apache Software Foundation under your leadership? How do you want the foundation to be positioned relative to the major technological shifts we’re experiencing—not just GenAI, but also cloud-native architectures, edge computing, quantum computing, and emerging regulatory frameworks around software supply chain security and AI governance? What legacy or impact do you hope to achieve during your time as President, and what would you say to technologists, organizations, or students who are considering getting involved with Apache projects or The Apache Way of building software?

Ruth Suehle: There are a few important things coming in the next few years, and none of them are about specific technologies. New technologies are exciting, of course, but part of the reason they’re exciting is because they come and go. So the best thing we can do as a foundation is provide a solid structure for any project to build a community and a healthy open source project. We also need to keep making the technical improvements that will help them and their users, like the work we’re doing to build a foundation-wide release process and tooling infrastructure that enable ASF projects and incoming Incubator projects to fully comply with not only with the CRA, but all of the new regulations developing around the world. 

If it’s not already obvious, the best thing I think The ASF can do, and the best way I can help is president, is to set an example for how to build good communities, both within our own foundation and in our collaboration with others. And the best thing that anyone who cares about the future of open source can do right this minute is not writing more code (which we’ll keep doing anyway), but to go find another person and turn them into a contributor, keeping in mind that the ecosystem is now vast and needs a lot more variety of skills than just writing code. For my part, I am always happy to share what I know, because hoarding knowledge helps no one. I frequently end talks by telling people if there’s anything I know that can help you, whether that’s finding ways to contribute, learning about how to bring your project into The ASF, starting an OSPO, or even making stellar baked goods, please reach out, and that goes for any reader here. Community over code thrives with each one of us building a little more community (and baked goods certainly never hurt!).

……………………………………………..

Ruth Suehle is the director of the open source program office at SAS, an analytics, data management, and AI software company. She is also president of the Apache Software Foundation and a member of the Open Source Initiative (OSI) board of directors. Ruth has helped build open source communities for nearly two decades, much of which she spent in the Open Source Program Office at Red Hat. Co-author of Raspberry Pi Hacks (O’Reilly, December 2013) and previously editor of Red Hat Magazine and opensource.com, Ruth is a writer and core contributor at GeekMom.com.

……………………………..

Follow us on X

Follow us on LinkedIn

Nov 4 25

On Database Query Performance in HeatWave and MySQL. Interview with Kaan Kara 

by Roberto V. Zicari

 Of course, in practice, no query optimizer is perfect and there will be edge cases where the way a query is written will impact its performance.”

Q1. What are your current responsibilities as Principal Member of Technical staff?

Kaan Kara : I am contributing as the tech lead for query execution in HeatWave. My main responsibility is implementing new features in HeatWave, maintaining its stability, and supporting our customers with their HeatWave-related use cases.

Q2. Let´s talk about improving database query execution time. The way a query is written has a massive impact on its performance, and developers often face hurdles in structuring them optimally. What is your take on this?

Kaan Kara : SQL is a declarative language. That means, in ideal terms, the database optimizer should produce the best query plan possible to answer the query, no matter how it is written. So, there should not be a need to optimize queries at SQL level. This is what we strive for when designing optimizers. Of course, in practice, no query optimizer is perfect and there will be edge cases where the way a query is written will impact its performance. I believe there are two practical ways a database service can help address this: The first approach is providing insights into the query plan and its execution. Our goals is to offer detailed and understandable insights about the query plan to our customers, so that they can see where the bottlenecks are, for more info please click here and here.

Once they see the bottleneck, they can think about how the query can be rewritten or certain optimizer hints could help, and so on.Secondly, it is important that the database itself provides alternative execution schemes or user-guided optimization methods. For instance, we recently introduced materialized temporary tables in HeatWave. Once the user sees that a certain query subtree is taking a long time, they can decide to create a materialized view on it, substantially accelerating their queries.

Q3. Indexing is the most common and effective way to speed up queries, what are the major source of challenges developers face?

Kaan Kara : Indexes come with maintenance cost, and they are often used without proper analysis of the trade-offs between that cost and the performance benefit they provide. HeatWave, with its in-memory columnar data architecture, helps eliminate the need for most indexing in analytical workloads. However, there are certain use cases where indexes provide value. One example is vector embedding-based nearest neighbor search, where index-based lookup is needed to ensure low response times. After introducing native VECTOR type last year, ), we recently introduced VECTOR-based indexing in HeatWave, enabling our customers to run approximate nearest neighbor search queries up to 2 orders of magnitude faster. One interesting direction we took was that we did not want to sacrifice on the result fidelity. We are employing a novel method that utilizes the index only when we believe the results it produces will be accurate.

Q4. Sometimes, the problem isn’t the query itself but the foundation it’s built on. Can you share your experience with this?

Kaan Kara : That is a very good point. Schema design plays a critical role in performance optimization. In some use cases, we see queries with predicates based on complex string operations or regular expressions, which make the query much slower than if the same predicate were applied to numeric columns. But this ties back to ease of use and declarative nature of interacting with databases. Ideally, the user should not have to worry about these things and do the most convenient thing, and the database should take care of optimizations behind the scenes.

In HeatWave, we strive to achieve this goal guided by real-world use cases from our customers. For example, we often observe read-heavy workloads repeatedly running the same expensive query subtree. To address this, we are developing an automated result cache that can materialize this subtree result within HeatWave and use it later when it is needed. We believe this feature will significantly improve query performance in many scenarios.

Q5. In a real-world application, a query doesn’t run in isolation. The performance of MySQL is heavily dependent on its configuration. What are your recommendations here?

Kaan Kara : That is true. Thankfully, we have a set of features in our Autopilot suite, which eliminate much of the configuration guesswork. For instance, depending on user’s data and sample queries, Autopilot suggests the correct cluster size, data placement key, appropriate column encodings, and much more. But it is usually not a one and done approach with configuration. User’s data and queries change over time. So, it is also crucial to provide detailed insights into the system consistently, so that adjustments can be made.An example is the need for efficient compute up and down scaling. Some customers require more compute in their peak operating hours for faster queries. In HeatWave, we provide zero downtime compute elasticity (YouTube video), thanks to our partitioning-based data architecture to cater for that need.

Q6. Beyond query-level tuning, what are the most significant architectural challenges that impede query performance, such as handling I/O bottlenecks from large table scans, managing inefficient data access patterns caused by normalization choices, or addressing network latency in distributed database environments?

Kaan Kara : This is a great question and one of the core things that we deal with daily when optimizing the HeatWave query execution engine. For an efficient distributed analytics engine, optimizing for I/O bottlenecks (for HeatWave, this means primarily memory and network) is at the top of the priority list. HeatWave has many optimizations to reduce these bottlenecks. For instance, we utilize an efficient vectorized bloom-filter to reduce the amount of probe-side data that we need to shuffle around in our cluster when performing a distributed join.

Driven by our customer workloads, recently we have worked on a late-materialization feature. Our customers work with wide string columns frequently. To reduce frequent access to these, we perform a transformation in our logical plan: Any wide columns that are not needed are removed from leaf table scan nodes; instead, we project the primary keys. Later in the plan, we introduce additional joins utilizing these primary keys to gather the wide columns that the query needs to produce the result. This feature will improve performance for certain production queries which project many wide columns by a significant amount.

Q7. Specifically, as of MySQL 9.3.0, it is possible to create temporary tables that are stored in the MySQL HeatWave Cluster. What are these table used for?

Kaan Kara : Yes, our customers can now create temporary tables directly within HeatWave, as in-memory materialized tables. Previously, the only way to load a table into HeatWave was through loading an InnoDB table or loading an external table from object storage. But sometimes, users want to store the result of a query as a temporary materialization without going through the load path, which can be a bottleneck.

Q8. Are these tables similar to conventional database views?

Kaan Kara : They are very similar to materialized views, but temporary tables are static. So, changes in the base tables will not be propagated and temporary tables themselves cannot be changed. If the customer use case requires change propagation from base tables, then materialized views are the right approach, which will be supported soon in HeatWave.

Q9. Can you please explain how these MySQL HeatWave temporary table help reducing query execution time?

Kaan Kara : Let me give an example: Consider an analyst investigating the transactions on a certain publicly traded stock. The queries will need to perform a join between “stocks” and “transactions” tables on some stock-id, followed by further aggregations (getting volume by date) or maybe further joins and ordering (sorting by largest buyers in each timeframe) etc. In this example, the initial join between “stocks” and “transactions” needs to be performed repeatedly and can be an expensive part of the queries. The analyst can now create a materialized temporary table based on the result of this join directly within HeatWave and it can be used later as much as needed by other operations.

Q10. Is calculating the Load factor, i.e. measuring of how full a hash table is, really a good metric to calculate Query Execution Times? Or are there any metrics that need to be taken into consideration?

Kaan Kara : By itself, it is a narrow metric and only relevant to figure out a single join’s or an aggregation’s cost. During our physical compilation, this metric contributes to our cost estimation indirectly: Depending on a join’s build side cardinality or a group-by’s output cardinality, we choose an appropriate hash table size. This size then dictates the runtime and memory cost of each operation. To estimate the query cost holistically, all relational operators along with how much data will be moved around is then considered.

Q11. What is your next project you wish to work on?

Kaan Kara : My next projects are around automatic maintenance of materialized views within HeatWave. This entails automatic substitution and creation of materialized views. We are excited to share more soon.

………………………………………………………

Kaan Kara is a principal member of technical staff at Oracle, working as a lead developer mainly responsible for query execution in HeatWave MySQL.

As part of the HeatWave team, he has led multiple projects that substantially improved the performance and the memory efficiency of the query execution engine. A sample of the projects include pipelined relational operator execution, bloom-filter enhanced distributed joins, base relation compression, and late decompression optimizations. Collectively, these improvements led to factors of geomean reduction in analytical benchmarks, such as TPC-H and TPC-DS,  while reducing the memory requirements of the in-memory execution engine, enabling a single HeatWave node with 512GB memory to run the 1TB TPC-H benchmark in full.

More recently, he was the lead developer introducing the new VECTOR type to MySQL, along with highly optimized vector processing functions within HeatWave, laying the data layer foundation that enabled highly anticipated vector store features within HeatWave, such as semantic search and retrieval-augmented generation.

Prior to joining Oracle, Kaan received his doctoral degree in 2020 from ETH Zurich, Systems Group in Computer Science Department. His research focused on using reconfigurable hardware devices (FPGAs) to accelerate data analytics. He has published papers in top database venues such as VLDB and SIGMOD, showcasing the potential benefit of FPGA-based implementations for data partitioning and in-database machine learning tasks.

Resources

On HeatWave MySQL: Query Execution, Performance, Benchmarks, and Vector type. Q&A with Kaan Kara. ODBMS.ORG MARCH 4, 2025

…………………………………….

Follow us on X

Follow us on LinkedIn

Oct 10 25

Beyond the AI Hype: Guido van Rossum on Python’s Philosophy, Simplicity, and the Future of Programming.

by Roberto V. Zicari

” I am definitely not looking forward to an AI-driven future. I’m not worried about AI wanting to kill us all, but I see too many _people_ without ethics or morals getting enabled to do much more damage to society with less effort.”

Q1. The “Zen of Python” emphasizes simplicity and readability. As AI and machine learning systems become increasingly complex, do you believe these core principles are more important than ever, or do they need to be re-evaluated for this new era?

Guido van Rossum: Code still needs to be read and reviewed by humans, otherwise we risk losing control of our existence completely. And it looks like models are also actually happiest coding in languages like Python that have a “humanist” philosophy — since LLMs are good at handling human language structures, and programming languages are in the end intended for human use, it follows that (given some training) such languages are also great to be read and write by LLMs. And most LLMs have had great training in Python.

Q2. When you first created Python, did you ever envision it becoming the dominant language for scientific computing and artificial intelligence? What factors do you believe were most critical to its unexpected success in these fields?

Guido van Rossum: I had no idea! I was not ambitious at all (still am not). I do think that the critical factors to success were twofold. First, as a language, it’s super easy to understand, yet quite powerful. As Bruce Eckel observed, “it fits in your brain”. The second factor is that I designed it to support really good integration with OS services and third-party libraries. This made it versatile and extensible, e.g. by allowing major libraries like NumPy to be developed basically independently from Python itself.

Q3. With the recent work on making the Global Interpreter Lock (GIL) optional and the general demand for performance in AI, what is your perspective on the future of parallelism and concurrency in Python? How crucial is this for the language’s longevity?

Guido van Rossum: I honestly think the importance of the GIL removal project has been overstated. It serves the needs of the largest users (e.g. Meta) while complicating things for potential contributors to the CPython code base (proving that new code does not introduce concurrency bugs is hard). And we see regularly questions from people who try to parallelize their code and get a slowdown — which makes me think that the programming model is not generally well understood. So I worry that Python’s getting too corporate, because the big corporate users can pay for new features only they need (to be clear, they don’t give us money to implement their features, but they give us developers, which comes down to the same thing).

Q4. You were a key advocate for introducing type hints into Python. How do you see static typing evolving within the language, and what role do you think it plays in building the large-scale, mission-critical AI applications we see today?

Guido van Rossum: I don’t know of any large-scale mission-critical AI applications, but I know of plenty of large-scale mission-critical non-AI applications and for those it’s essential to have type hints — otherwise no other tools can do much with your code base. I’d say the cut-off for using type hints is at about 10,000 lines of code — below that, it’s of diminishing value, since a developer can keep enough of it in their head, and traditional dynamic tests do a good-enough job. But once you reach 10,000 it’s hard to maintain code quality without type hints. I wouldn’t foist them upon beginners with the language though.

Q5. The transition from Python 2 to 3 was a significant, and at times challenging, chapter in the language’s history. What were the most important lessons from that experience that could inform future major evolutions of Python, especially as new paradigms emerge?

Guido van Rossum: I don’t know how paradigms would affect this (paradigm shifts effectively mean that past experience doesn’t help understand the new reality), but the key lesson is that for any future transitions (even 3.x to 3.x+1) we must always consider how we can support old applications without requiring them to change. Basically the approach to migration must be carefully considered, especially since most libraries have to support a range of versions (something that we didn’t sufficiently appreciate with 2-to-3, and for which we had no good solution planned).

Q6. Python’s simplicity is one of its most celebrated features. As new, powerful libraries for AI add layers of abstraction and complexity, what do you think is the best way for the community to keep the language approachable and prevent it from becoming overwhelming for beginners?

Guido van Rossum: So far the AI libraries I’ve used are not particularly powerful or complex — they just give people a way to talk to a server that can perform some magic. It’s no different than figuring out how to use some of the more complex internet protocols. Maybe the main difference is that AI providers are in such a hurry that they change their APIs every three weeks and provide horrible, chaotic documentation. 🙂 In the end we will do what we’ve always done — the world of software is built on libraries and APIs.

Python has survived many dramatic changes in computing unscathed (in the early ’90s the Internet barely existed, and e.g. Microsoft was distributing software on floppy disks and CD-ROMs — we made it through the development of the Internet and the World-Wide Web, from centralized computers to PCs to software running in the browser, and through huge scaling improvements of hardware).

Q7. Given the specific demands of modern AI development—from data manipulation to model training—if you had the power to add one major feature or change to Python’s core today, what would it be and why?

Guido van Rossum: Nothing comes to mind. AI is over-hyped. It’s still software. In my own use of AI we make good use of it with the help of some small libraries that harness the power of AI to do useful things (notably human language understanding and generation) to data that we manipulate in quite traditional ways. Some of our code is written by a so-called “agent”. But we don’t use “vibe coding” — we stay in control where it comes to architecture and API design.

Q8. Newer languages like Mojo and Julia are being developed specifically for high-performance AI. How do you view this emerging competition, and what must Python do to maintain its leadership position and stay relevant for the next decade of technological advancement?

Guido van Rossum: Mojo is intended to *implement* high-performance AI “kernels”, which is a very exacting piece of classing computer optimization. It has no chance of replacing Python’s ecosystem — that’s just not what they are interested in. I don’t recall Julia being used for high-performance AI — it’s used for high-performance numerical computation, which can serve AI just as well as it can serve other demanding application domains.

Q9. Your role has evolved from Benevolent Dictator for Life (BDFL) to a distinguished engineer at Microsoft. How has this transition influenced your perspective on Python’s development, its community governance, and its place within the larger corporate tech ecosystem?

Guido van Rossum: It’s clearly a demotion. 🙂 I was BDFL until it was no longer possible for a single person to take on all the responsibilities of Python governance. I retired from my day job. I ended up at Microsoft because I realized I wasn’t ready to stop coding, and after Google and Dropbox (and with the ghost of Ballmer thoroughly expurgated) it seemed a good place to try and have some more fun coding.

Q10. Looking back at your incredible journey with Python and looking forward to an AI-driven future, what do you hope the ultimate legacy of Python will be? And on a personal level, how do you envision the craft of programming itself changing in the coming years?

Guido van Rossum: I am definitely not looking forward to an AI-driven future. I’m not worried about AI wanting to kill us all, but I see too many _people_ without ethics or morals getting enabled to do much more damage to society with less effort. The roots for that abuse have been laid by social media, though — another major computer paradigm shift that changed society but didn’t really affect the nature of software.

I hope that Python’s legacy will reflect its spirit of grassroots and worldwide collaboration based on equity and respect rather than power and money, and of enabling “the little guy” to code up dream projects.

………………………..………………

Guido van Rossum is the creator of the Python programming language. 

He grew up in the Netherlands and studied at the University of Amsterdam, where he graduated with a Master’s Degree in Mathematics and Computer Science. 

His first job after college was as a programmer at CWI, where he worked on the ABC language, the Amoeba distributed operating system, and a variety of multimedia projects. During this time he created Python as side project. He then moved to the United States to take a job at a non-profit research lab in Virginia, married a Texan, worked for several other startups, and moved to California. 

In 2005 he joined Google, where he obtained the rank of Senior Staff Engineer, and in 2013 he started working for Dropbox as a Principal Engineer. 

In October 2019 he retired. After a short retirement he joined Microsoft as Distinguished Engineer in 2020. Until 2018 he was Python’s BDFL (Benevolent Dictator For Life), and he is still deeply involved in the Python community. 

Guido and his family live in Silicon Valley, where they love hiking, biking and birding.

…………………………………….

Follow us on X

Follow us on LinkedIn

Sep 11 25

On Debugging with AI. Interview with Mark Williamson

by Roberto V. Zicari

“Quality of code (and everything that goes along with it) isn’t talked about enough in AI conversations!  There are some obvious facets to this – does the code do what you intended?  Is it fast?  Does it crash in the common cases?”

Q1. Can AI write better code than humans?

Mark Williamson: I don’t think so, at least not today.  For one thing, LLM-based AIs are trained on pre-existing code, which was written by fallible humans.  So they at least have the potential to make all the mistakes we do.

Despite that, any coding AI you pick will write better frontend Javascript than me – that’s not my area of expertise.  But I would back an experienced human (with or without AI assistance) to beat an unsupervised AI coder.

Can they beat humans some day?  I assume so – but they’re not doing it today.  And when you factor in other aspects of the Software Engineer’s job (such as building the right thing) it’s even more challenging.

Q2. How do you define what is a “better” code?

Mark Williamson: Quality of code (and everything that goes along with it) isn’t talked about enough in AI conversations!  There are some obvious facets to this – does the code do what you intended?  Is it fast?  Does it crash in the common cases?

A lot of the work a human developer does to achieve this is actually achieved after the initial code is typed in.  There’s an iterative process of learning about and refining the solution – understanding what you’ve made and improving on it.  A lot of this is really debugging, in the broadest sense of the term: the code doesn’t do what you expected and you need to understand and fix it.

There’s another step beyond that, though – whether the code fits its intended purpose.  Getting that fit requires understanding the end user, thinking through the implementation tradeoffs and anticipating future developments.  For now, I see AI as freeing up some time so we can create space for those human insights.

Just focusing on how many lines of code we create is a pattern in the industry – we overvalue simply generating code versus all the other things that software engineers actually do.

Q3. Can AI write some types of code faster and with fewer simple errors?

Mark Williamson: Yes!

In my experience, I’ve found AI to be extremely useful in three scenarios:

  • Writing code that is almost boilerplate – where it’s not a copy-paste problem but requires quite routine changes.
  • Writing code that would be boilerplate for a different engineer – e.g. if I want to write JSON serialisation / deserialisation code in Python it’s easier for me to get an AI assistant to show me the shape of a good solution.
  • Doing refactors that involve restructuring or applying a small fix in a lot of places – a coding agent can handle the detail while I concentrate on the overall shape.

In all these cases, the benefit is in reducing the amount of thinking required to figure out my design approach.  In Daniel Kahneman’s book Thinking Fast and Slow, he describes two modes of thought: System 1 and System 2.  System 1 is the stuff you can just answer automatically, whereas System 2 thought requires effort.

System 2 is tiring – you probably can’t manage more than a couple of hours of really hard thinking about code in a day.  So it’s precious.  An agent lets me offload some work so I can focus that effort on exploring solutions to the real problem I’m trying to solve.

Q4. Large Language Model (LLM)-based AI code assistants are powerful tools, but they have significant limitations that developers must understand. What are such limitations?

Mark Williamson: The most obvious limitation is that they don’t know everything.  They often act as though they do, which is a trap.  “Hallucinations” are the most well-known consequence of this – in which the LLM gives an answer that is confident but ultimately not based in fact.

I like to say that modern AI’s training teaches it what a good answer looks like – they’ve seen lots of examples of them, after all.  So, from an AI’s point of view, a good answer includes attributes like:

  • Projecting confidence.
  • Using the right terminology.
  • Relating suggestions specifically to your question and context.
  • Being right!

If it can satisfy most of those, then it’ll think it’s done a good job.  So when they’re asked a question and they lack facts, an AI will figure “3 out of 4 isn’t bad” and give a dangerously convincing answer that’s not based in reality.

There are two important things we can do to reduce this risk:

  • Supply high-quality context to the underlying model – the more relevant information available the better.  Supplying insufficient information invites the model to guess and supplying irrelevant information encourages it to head off on the wrong track.
  • Verify the model’s answers against a ground truth – run your tests, have experts review your code, verify the dynamic behaviour of the application matches what you expected.

You want to focus the model’s intelligence on solving the real problem (not on guessing), then know when it has actually solved it.

Q5. While LLM-based code assistants are incredibly powerful, there is critical information they lack that limits their effectiveness and makes human oversight essential. Why this?

What does it mean in practice?

Mark Williamson: As a CTO, I’ll divide my answer into two parts:

  • As an engineer, LLMs don’t know enough about your code to solve all the problems you wish they could solve.  They typically don’t have good knowledge of the runtime behaviour of the system, which makes incorrect answers more likely.  And they’re not good at inferring design intent, making it harder to fix subtle bugs correctly.
  • As a product manager, LLMs lack the insight into the true purpose of the software to be built.  You cannot rely on them to design the code to the needs of the end users, long term evolution / maintenance and business tradeoffs required.

Q6. LLMs are brilliant at static analysis—interpreting the text of a codebase, logs, and other documents. But they are blind to dynamic behavior. This is the critical information they lack and cannot get. Why? Do you have a solution for this problem?

Mark Williamson: Coding agents have a similar weakness to humans: they can’t see what the program really did at runtime and it’s hard to reason about why things happened.  They can get some of this from logs (and LLMs are really good at reading logs!) but logging can only capture so much.

There’s a catch 22 here for the developer: if you’d been able to predict precisely what logging you’d need to fix the bug you’re investigating, then you’d have known enough to avoid the bug in the first place.  There’s no reason to think that’s different for LLMs.

Coding agents can follow the same tedious loop that humans do: adding more logging to a codebase and running stuff again (or perhaps asking a human to obtain more logs some other way).

They can even do this toil more enthusiastically than any human! But the speed you gained from the agent may just disappear into a swamp of rebuilding, attempting to reproduce, finding what logging statements are still missing and then repeating the process.  This kind of inefficiency will be bad news for any Engineering department hoping to improve productivity in return for their AI spend.

Q7. It seems that time travel debugging (TTD) directly addresses this limitation. Please tell us more.

Mark Williamson: Time travel debugging captures a trace of everything a program does during execution.  The resulting recordings effectively represent the whole state of memory at every machine instruction the program executed.

Anything you want to know about the program’s runtime behaviour can then be queried from the recording, without needing to re-run or change the code.  Rare bugs become fully reproducible and any state can be explored in detail.  Moreover, the ability to rewind time makes it easy to explore why a bad state arose, not just what the state was.

Of course, storing all of memory at every point in execution time would be extremely inefficient!  A modern, scalable time travel debugger stores only information that flows into the program (initial memory state, IO from disk and network, system calls results, non-deterministic CPU instructions, etc).  This makes it possible to efficiently recompute all other state on demand.  Watch the talk “How do Time Travel Debuggers Work?” for the full details on how a modern time travel debugger is built.  

For an AI, this capability is ideal.  Remember that we need high-quality context to feed the model and a ground truth to make sure its answers are based in reality.  With time travel debugging, a coding agent has access to a recording of the program’s dynamic state and can drill down in detail on any suspicious behaviours – that gives us high-quality context.  The ground truth comes from the deterministic nature of the recording and also makes it possible to verify the AI’s findings.

These properties mean that AI coding agents get smarter when given access to a time travel debugging system.

Q8. You have released an add-on extension called explain, which integrates with your UDB debugger (part of the Undo Suite). What is it and what is it useful for?

Mark Williamson: Good question. Let me explain first what Undo is to set the context. It’s our time travel debugging technology (which runs on Linux x86 and ARM64) and is mostly used to debug complex enterprise software that makes use of advanced multithreading techniques, shared memory, direct device accesses, etc.

The Undo Suite captures precise recordings of unmodified programs using just-in-time binary instrumentation.The two main components of the Undo Suite are:

  • LiveRecorder – which captures program executions into portable recording files.
  • UDB – which provides a GDB-compatible interface to debug both live processes and recordings (but also integrates into IDEs such as VS Code).

The explain extension is our first step in integrating AI with a time travel debugging system.  It provides two pieces of functionality:

  • An MCP (Model Context Protocol) server – this exports the functionality of our UDB debugger for use by an AI agent, allowing it to integrate into existing AI workflows including agentic IDEs (such as VS Code with Copilot, Cursor or Windsurf).
  • The explain command itself, which provides additional tight integration with terminal-based coding agents (such as Claude Code, Amp and Codex CLI) where available.

In either case, we’re providing the power of time travel debugging to an AI, so that it can reason about the dynamic behaviour of a program.  As the name suggests, this extension has a particular focus on explaining program behaviour – how a given state arose, why the program crashed, etc.

We provide a carefully-designed set of tools to the agent so that it can answer these questions effectively. It’s important that the design of the MCP tools guides the actions to be taken by the LLM, otherwise it can easily get overwhelmed by the complexity.

In an agentic IDE you can connect to the MCP server in a running UDB session – then ask the agent questions (use the /explain prompt exported by the server for best results).  In UDB itself, you can just type the explain command and we’ll automatically invoke your preferred terminal coding agent and put it to work on your problem.

Q9.  Can you show us an example of how time traveling with an AI code assistant works in practice?

Mark Williamson: Sure! I’d recommend watching these two demo videos:

  1. The cache_calculate demo video on the Undo website which showcases how to use explain to get AI to tell you what has gone wrong in the program.
  2. This YouTube video where I use AI + time travel debugging to explore the codebase of the legendary Doom game and understand exactly what the program did when I played it.

We have additional demos, showcasing more advanced functionality, which aren’t yet public – you can book a personalised demo from https://undo.io/products/undo-ai/ to see the more advanced AI debugging functionality we’re currently building.

Qx. Anything else you wish to add?

Mark Williamson: The core message here is that AI-Augmented Software Engineers still need the right tools to do their jobs well.  Our goal is to make AI coding agents more effective at understanding and fixing complex code, improving the return on investment Engineering departments get on their AI stack.

The next big step for us will be designing a UX to be used by AIs instead of by humans.  Providing time travel debugging to a coding agent is already useful, but to get the best performance we need to work with what LLMs are good at.  In other words:

  • A query-like interface: rather than the statefulness of a debugger, LLMs are happiest when they can ask Big Questions and get a report in answer.  Our engine lets us extract detailed information very quickly from a recording so that an AI can start with an overview, then drill down.
  • Specialised, composable tools: a debugger provides quite general tools (stepping, breakpoints, etc) for a human developer to apply to any problem.  Coding agents can use these but we believe LLM intelligence is best spent on solving the core problem well, rather than diluting it on planning complex tool use.  A specialised set of analyses will allow the LLM to focus on what it’s good at – finding patterns and proposing fixes.

On top of these tools and the data contained within our recordings, we are building Undo AI – a product to enable agentic debugging at enterprise scale.  We’re currently taking applications for our pilot program, please get in touch to find out more at undo.io .

……………………………………………

Mark Williamson, Chief Technical Officer, Undo

After a few years as our Chief Software Architect, Mark is now acting as Undo’s CTO. Mark loves developing new technology and getting it to people who can benefit. He is a specialist in kernel-level, low-level Linux, embedded development with a wide experience in cross-disciplinary engineering.

In his previous role, his remit was to align the product’s architecture with the company’s needs, provide technical and design leadership, and lead internal quality work. One of his proudest achievements is his quest towards an all-green test suite!

As Undo’s CTO, Mark’s primary responsibility is to scale product-market fit and ensure we take our products in the right direction to meet the needs of a broader spectrum of customers.

Mark is also author on Medium, a conference speaker, and a new home owner enjoying the delights of emergency home repairs!

………………………..

Follow us on X

Follow us on LinkedIn

Aug 1 25

On Enterprise AI. Interview with Stephen Kallianos

by Roberto V. Zicari

“I wish more organizations realized how fundamental it is to lay a good foundation for any enterprise AI initiative. That foundation includes a robust data strategy and a unified architecture.”

Q1. What are the responsibilities of a Field CTO?

Stephen Kallianos:
As a Field CTO, my core responsibility is to serve as a trusted advisor for enterprise customers, especially at the Senior Architect and C-level. It’s a consulting role, focused on establishing credibility and helping customers and prospects connect their strategic priorities to SingleStore’s unique value proposition. 

The work involves identifying and clearly communicating the best-fit enterprise architectures, leveraging deep expertise in data and AI infrastructure. My role requires me to thoroughly understand customer challenges, align our technical solutions to those needs, and recommend the most effective solutions. 

In addition, I lead the presales function here at SingleStore: running technical discovery, developing tailored demonstrations and proofs of value, qualifying opportunities, and shaping value-based engagements that bridge the gap between technology and business results. Ultimately, my goal is to ensure that our solutions deliver both technical and business impact – setting organizations up for long-term success in their modernization efforts.

Q2. What’s something you often hear from customers and prospects?

Stephen Kallianos: Organizations are hungry for applications that can leverage the most recent data for AI-driven insights, but often get bogged down managing separate systems for transactional and analytics workloads — leading to increased complexity and database sprawl. I regularly hear concerns about inconsistent query performance, missed SLAs for real-time or batch data, and the growing need for flexible deployment options — be it cloud, on-prem, or hybrid. Most notably, there’s a surge in organizations looking to modernize: they want to drive real business outcomes by reducing operational overhead, simplifying their technology stacks, and future-proofing their data infrastructure to keep pace with rapidly evolving AI requirements and new digital experiences.

Q3. What are the most commonly shared pain points among customers seeking to implement enterprise AI?

Stephen Kallianos: Customers implementing enterprise AI encounter a few pervasive pain points. Common issues include: navigating data silos and complex integrations, struggling to perform large-scale aggregations efficiently, and dealing with the high costs and poor performance that come with scaling legacy data infrastructure. Meeting real-time data requirements for AI workloads is a particular challenge, especially when data resides in multiple, disparate databases. Legacy architectures often fail to deliver the query performance and SLAs necessary for AI use cases, leading to a pressing need to modernize and consolidate systems.

Q4. Do you see GenAI being used in the enterprise? How? 

Stephen Kallianos: Absolutely. Enterprises are rapidly adopting generative AI (GenAI). They’re integrating large language models (LLMs) into their AI architectures for a range of scenarios from analytics and customer support to productivity tools and operations. We’re seeing production deployments in areas like enterprise search (retrieving contextually relevant records and documents), AI-powered personal assistants and co-pilots, workflow automation, developer productivity tools (text-to-SQL, code recommendations), and even advanced analytics for fraud detection and data enrichment

Q5. What do you wish more organizations knew when it comes to adopting enterprise AI?

Stephen Kallianos: I wish more organizations realized how fundamental it is to lay a good foundation for any enterprise AI initiative. That foundation includes a robust data strategy and a unified architecture.

Relying on siloed or hastily patched-together systems makes it almost impossible to achieve the simplicity, security, or scale needed for AI to succeed in production. The best results come from adopting a single platform that handles analytics, machine learning, and operational workloads — streamlining architecture and lowering risk. Ultimately, AI projects succeed when technical outcomes are tightly aligned to clear business value — not when technology is adopted for its own sake.

Q6. How can organizations better align their technical solutions with the organizational goals?

Stephen Kallianos: In my experience, the best way organizations can align technical solutions with their business goals is for the organization and the database vendor to hold a workshop to clarify and align on both the desired business outcomes and technical requirements. Mutually qualifying opportunities up front — ensuring there’s clarity and genuine need — helps avoid wasted effort later. Together, you can frame what success looks like and define concrete criteria, creating a North Star for architecture, implementation, and measuring results. Having hands-on proof-of-value phases (using real data and involving cross-functional teams) is key to validating that proposed solutions actually deliver the anticipated outcomes, and is an approach I use extensively when leading presales and customer workshops.

Q7. SingleStore recently unveiled a new version of the database, and it contains a lot of upgrades. In your opinion, which 2-3 things are most valuable to customers? Why?

Stephen Kallianos: For me, several features from the latest SingleStore release stand out as particularly remarkable. These include the major upgrades we made to Flow, Iceberg, Aura, and our developer experience.

The first is about ingest and data integration. With SingleStore Flow (our no-code solution for data migration and continuous change data capture) now natively embedded in our Helios managed service, customers can orchestrate data movement into SingleStore directly within the cloud platform, making the process far more streamlined. Data ingestion is now much simpler and more flexible, and moving data from heterogeneous sources like Snowflake, Postgres, SQL Server, Oracle, and MySQL is easier than ever. 

This is part of our overall “SingleConnect” experience that allows customers to incorporate more and richer data sources into SingleStore. Adding Flow into SingleStore Helios® further strengthens our ability to integrate from diverse environments, reducing integration friction and enabling real-time analytics and AI use cases without the pain of traditional ETL complexities.

We’ve also done a lot to enhance our Apache Iceberg ecosystem. For customers using data lakehouses with Apache Iceberg, there’s now a speed layer that offers high-performance, low-latency data interaction on top of Iceberg-managed storage. Improved bi-directional integration allows for easier, faster data exchange with external Iceberg tables, so real-time applications can finally tap into lakehouse architectures with the latency and interactivity they require.

In the area of AI and serverless compute, we upgraded our Aura Container Service. Aura brings together vector search, analytics, function-as-a-service, and GPU-accelerated workloads in a single containerized environment. Already optimized for running AI/ML and containerized workloads, Aura now offers support for cloud functions (lambda-style serverless functions). This unlocks the ability to build data APIs, agents, and inference endpoints for embeddings or other ML tasks, all within a managed, scalable environment. Coupled with performance enhancements like multi-value indexes for JSON, automatic query re-optimization, and improved cross-workspace branching and disaster recovery, these upgrades drive higher reliability and enterprise scalability.

We’re always thinking about the people doing the work, so we made substantial improvements to the developer experience, with enhancements to our AI-powered query builder assistant (SQrL), deeper integrations with GitHub, notebook scheduling/versioning, better pipeline and billing visibility, and a more powerful multi-tab SQL editor. All of these improvements make building, monitoring, and scaling AI and data applications faster and more seamless.

Collectively, these advances eliminate bottlenecks, simplify integration, and provide the speed, flexibility, and full-lifecycle support today’s enterprise AI and analytics apps demand.

Q8. Which is more prominent, on prem or cloud? With security and privacy being big concerns these days, are people talking about returning on-prem?

Stephen Kallianos: Our strategic advantage is that we are truly hybrid — with the most versatile offering across SaaS (Helios), Bring Your Own Cloud (BYOC), and self-managed solutions. We provide maximum flexibility for customers to deploy anywhere, on any cloud, allowing them to meet their specific business, technical, and regulatory needs. We’re seeing continued momentum around our managed cloud service, Helios—driven by a desire for operational simplicity, scalability, and innovation. But our BYOC and self-managed (private cloud) solutions are also going strong. This flexibility means customers can mix and match approaches: leveraging Helios for fully managed simplicity, BYOC for deployment control, or self-managed options for maximum security and privacy. Ultimately, we empower customers to modernize on their terms, run workloads wherever they need, and never have to compromise on control, compliance, or agility.

Q9. What’s next for the industry, and what is SingleStore doing to help meet those needs?

Stephen Kallianos: The next evolution in our industry is all about convergence — bringing together analytical, transactional, and AI workloads on unified data platforms. Today, less than 1% of enterprise data is being used for enterprise AI, so the opportunity is immense. There’s a heightened focus on delivering real-time intelligence, integrating AI natively, and eliminating data silos, alongside surging demand for seamless integration with data lakes, warehouses, and GenAI/LLM platforms. SingleStore is innovating aggressively by expanding serverless compute, adding integrated AI and ML functions, launching AI co-pilots, enabling direct LLM integration, and introducing the Aura platform that I mentioned earlier. These advances are designed to enable customers to build the next generation of data-driven and AI-powered applications — unlocking more value from their data and making enterprise AI real for the business.

………………………………………

Stephen Kallianos, Americas Field CTO, SingleStore.

With deep expertise in data-driven strategies and cloud-based innovation, Stephen Kallianos is the Americas Field CTO at SingleStore. In this role, he combines his SingleStore expertise and industry knowledge to drive a collaborative approach towards helping customers align solutions with business goals.

………………………..

Follow us on X

Follow us on LinkedIn

Jul 11 25

On Trading Analytics. Interview with Cat Turley

by Roberto V. Zicari

 Trades are driven by real-time market conditions where billions of dollars move every second, generating enormous amounts of data. The biggest challenge is minimizing the latency associated with analyzing these chaotic data streams and turning it into something that’s actionable for traders.”

Q1. What is your role at ExeQution Analytics?

Cat Turley: I’m the CEO and founder of ExeQution Analytics. We’re a boutique consultancy focused on helping financial organizations, particularly trading firms, take greater advantage of their data infrastructure. The story of ExeQution Analytics began 20 years ago, when I was working at an international broker. I challenged the head of trading to “do more” as I believed that we could write more interesting analytics and achieve better understanding of the markets and our trading patterns. I truly believed we were only skimming the surface of what kdb+ could achieve. He returned the challenge and invited me to build a green-field analytics platform capable of understanding market microstructure and providing real-time and historical signals to electronic trading strategies. 

Over the past two decades, I’ve continued to refine this approach to analytics. Four years ago, we officially launched ExeQution Analytics as demand had grown, and we identified a gap in the market. There were plenty of resources focusing on the acquisition and storage of data, but less focus on what the data was used for. We developed a structured and flexible analytics framework to solve the problem that everyone was seeking to solve: how to make analytics more efficient and accessible across all aspects of the trading lifecycle. Now my role requires that I work closely with those we have partnered with, from financial organisations on both sides of the street, to technology leaders such as KX. 

Q2. How do you help organizations maximize the value of their technology investments and improve data-driven innovation?

Cat Turley: What makes ExeQution Analytics unique is that we’re positioned right at the intersection of the three pillars of trading: the traders themselves, quants and technology leaders. We speak all three languages and provide a framework that helps everyone achieve their common goal of delivering better trading outcomes. Our standardized framework efficiently analyzes large volumes of market data at speed and scale. From there we create customized analytic platforms that enable firms to gain enhanced and actionable insights tailored to their unique trading workflows. 

Trading teams cannot accelerate innovation if they’re stuck spending all of their time preparing data. We’re giving them the tools necessary to remove the onus of data preparation and instead focus on extracting signals, identifying patterns, and understanding market activity. When armed with these insights, firms can test more ideas, move faster, ask better questions about their data, and ultimately generate strategies that improve trading outcomes.

Q3. Let’s talk about Quant. What do they do and how have they evolved? 

Cat Turley: Quant teams build and refine models that power trading strategies, everything from price prediction to portfolio optimization. This has always been a data-driven process, but over the years, thanks to the advancement and accessibility of computational tools, it has increased in complexity and sophistication. Now, quants can efficiently and quickly analyze years of historical market data to glean unique insights that optimize market prediction and trade execution. 

Most financial organisations have been using advanced machine learning capabilities for the last decade or so, enabling more sophisticated predictions. There is potential for even further evolution as advances in AI become more integrated into the quant trading process through the use of large language models, vector databases and techniques such as time series similarity search.

The second significant avenue of evolution is the integration of real time market data into the quant lifecycle, enabling better understanding of how models react to the ever-evolving market conditions. As data volumes grow, it has never been so important to remain agile in volatile market conditions. 

Q4. If we consider Intra Trade Monitoring: What are the challenges?

Cat Turley: Trades are driven by real-time market conditions where billions of dollars move every second, generating enormous amounts of data. The biggest challenge is minimizing the latency associated with analyzing these chaotic data streams and turning it into something that’s actionable for traders.  Trading once relied heavily on human intuition and experience with many decisions based on “gut feeling”, but with the advances in markets and technology, this instinct can now be augmented with data-driven understanding. The challenge is getting the right analytics in front of the right person at the right time, so they can make the best decision to improve trading outcomes. These days, traders are typically monitoring thousands of individual orders at any one time, as algorithms control execution. They need access to tools that can distill all this noise into actionable insight.  

Q5. How is Trading Analytics related to Quant and Intra Trade Monitoring? What do you see as major challenges here?

Cat Turley: Intra trade monitoring supports real-time observation and analysis of trade execution, feeding live data into analytics systems. Both traders and quant analysts depend on these analytics to measure performance feedback for refining computational models that drive pricing, forecasting, and trade decision-making. Essentially, these three components support pre- and post-trade analysis. 

The challenge many firms are facing is how to use trading analytics to transform TCA from a tick-the-box exercise to a more comprehensive framework to properly understand the nuances and intricacies of trading execution and opportunities for optimisation. Historically, pre-trade and post-trade were often considered separate processes. To truly optimise execution, they should consider two aspects of the same process, where one informs the other. 

This is where KX can offer an advantage: one of its unique attributes is that it’s a high-performance analytics database optimized for both real-time and historical data. Building a custom trading analytics platform using KX technology allows organisations to evolve towards more proactive analytics, enabling the identification of both optimisation opportunities as well as execution risks and alpha generation opportunities. Integrating real time data into the TCA/execution analysis or trading research process enables a better understanding of where a different trading decision would have resulted in an improved outcome, and back-testing with historical data can inform where this intuition offers statistically significant performance improvement. 

Q6. You have been using KX for over 20 years now. How did the KX ecosystem evolve over time?

Cat Turley: It’s changed dramatically. When I started over 20 years ago, obtaining a kdb+ license required a much more substantial investment in development resources, often turning into a one-to-two-year project before it was put into production.  Now, thanks to KX-driven platform releases and updates as well as tools like Data Intellect’s TorQ, the development of kdb+ infrastructure has been streamlined. Today, teams can take advantage of previous iterations and go to market much faster. It’s gone from a highly technical, custom-built process to something far more streamlined, accessible, and scalable. That means firms can focus on the nuance of their individual trading requirements, how to turn data into value, rather than spending as long on the building blocks of data ingest, storage and availability. 

Q7. You use kdb+ for trading. What are the main benefits you see in using such a database?

Cat Turley:  ExeQution Analytics framework is developed in q and designed for integration with kdb+ platforms. So, you could say that I’m a big advocate for q and kdb+, and that’s not just because of its unprecedented speed—but also the incredible flexibility that the q query language offers – it truly is the standout benefit. kdb+ is the fastest time-series database available, and q enables us to move beyond a data-warehouse to deliver genuine analytic platforms with a reduced time to market. That speed allows quants to pursue excellence – they can fail fast, learn fast, and keep improving their models. 

As I briefly mentioned earlier, kdb+ is unique because it can handle both real-time and historical data without compromising speed and performance. In the trading world, this combined benefit is what allows for better trading outcomes, rather than a series of missed opportunities. 

Q8. Specifically, how do you handle large volumes of real-time and historical data with low latency? Why does kdb+ being columnar matter?

Cat Turley:  Our approach combines fast in-memory processing for live data with efficient on-disk, columnar storage for historical data, enabling seamless and high-speed time-series analytics across both. And kdb+’s columnar absolutely makes a difference in optimizing low-latency performance because it only pulls the fields needed. 

Storing data in columns allows for faster reads, better compression, efficient CPU caching and parallel processing, all of which are ideal for fast-moving, analytical trading workloads. 

Q9. kdb+ offers Q, a SQL-like language. How easy is it to use, and how do you encourage adoption over SQL?

Cat Turley:  I am a huge advocate for the q language – it is elegant, expressive and enables incredibly fast time-to-market from a development perspective. An analytic written in q versus SQL or Python is typically 10 times more concise. This means it takes 10% of the time to write, your development team can be 10% of the size, and you incur 10% of the errors. It may have a reputation for being harder to learn but it is well worth getting over the initial learning curve, as it offers huge benefits once mastered, especially when dealing with time-series operations, high-frequency data and real-time decision making. 

The other benefit is that the q community is unlike any other; you can really lean on these folks for support and learning materials. SQL gives you access to data and is well-suited for general purpose tasks and projects, but q is purpose-built for speed and analytics at scale which are critical benefits for those working with high-speed or high-volume data.

Q10. kdb+ is used across hedge funds, investment banks, and trading firms. What are the similarities and differences among them when dealing with quantitative trading operations?

Cat Turley:  Their trading operations are similar in the sense that they are all working with high data volumes and require low latency. Of course, they have varying levels of latency tolerance and data flow, but overall, I would say the premise is the same: operations need to optimize data-intensive workloads and minimize time-to-decision. 

What I think differs is their objectives. Hedge funds prioritize strategy simulation and alpha generation, banks emphasize client service and pricing, and trading firms are laser focused on speed and execution edge. Regardless of the objective, all rely on the ability to process massive volumes of real-time and historical data with precision and speed.

………………………………………………………………

Cat Turley, CEO/Founder, ExeQution Analytics 

With 20 years’ experience working with leading global investment banks and some of the world’s largest asset managers, Cat has an extensive understanding of market microstructures, execution analysis and how the right choice of technology can empower organisations to achieve more with less. Cat is passionate about improving efficiency and understanding across all areas of trading and founded ExeQution Analytics to contribute towards this goal.

Related Posts

On Trading Tech and Quant Development. Interview with Jad Sarmo, ODBMS Industry Watch, May 29, 2025

………………………..

Follow us on X

Follow us on LinkedIn

May 29 25

On Trading Tech and Quant Development. Interview with Jad Sarmo

by Roberto V. Zicari

Forecasting financial time series is one of the most complex tasks in data science.

Q1. You’ve been working in the Trading Tech and Quant Development space for the last 20+ years. What are the main lessons you’ve learned through this experience? 

Jad Sarmo: Back in 2004, I deployed the first automated trading system (ATS) for foreign exchange at a top-tier bank. We had to build software directly on traders’ workstations to send algorithmic orders—latency was measured in hundreds of milliseconds. 

Since then, the landscape has evolved dramatically: the proliferation of low-latency submarine fiber-optic cables, high-frequency signals bouncing off the ionosphere, the emergence of cloud computing, AI-assisted development, the rise of blockchain, and nanosecond-level FPGAs. 

Despite this, the core principles remain unchanged: a solid grasp of systems and markets, clear business objectives, and the ability to assemble the right experts to solve the right problems. Equally important—especially as firms face increasing external scrutiny and apply for new licences—is a commitment to compliance with applicable laws and regulations from the outset.

A personal lesson I’ve come to value is this: if you’re comfortable, it’s time to take a risk, learn, and repeat. That cycle is essential in such a fast-evolving landscape.

Q2. What is your role at B2C2?

Jad Sarmo: B2C2 are a global leader in the institutional trading of digital assets, serving institutions such as retail brokers, exchanges, banks, and fund managers. We provide clients and the market with deep, reliable pricing across all market conditions. 

I joined B2C2 in 2021—during a pivotal year for digital assets—to build our global Quantitative Development desk. My team works closely with traders, researchers, and engineers to improve client pricing, trading strategies, and automated risk systems.

I also lead our Exchange Squad, which manages trading from market data ingestion to algorithm optimization across more than 30 AWS regions globally.

Q3. What are the main challenges in this industry when it comes to data management? Specifically, since you are handling liquid assets, what is the main challenge you’ve seen when an asset can be “easily” converted into cash in a short amount of time? 

Jad Sarmo: Like in any asset class—FX, equities, rates—crypto trading involves massive volumes of market and trading data. But crypto adds a unique layer of complexity.

It’s a 24/7 market, with both on-chain (blockchain-logged) and off-chain (centralized exchanges) activity. A significant share of volume also flows through DeFi protocols using smart contracts.

We face challenges like inconsistent exchange APIs (REST, WebSocket, etc.), cloud-native environments, and the need for extremely low-latency systems that handle massive data bursts. Meanwhile, newer or illiquid tokens present formatting hurdles, with decimals occasionally extending to 10+ digits — far beyond what many traditional systems were designed to handle.

Real-time hydration and normalization of incoming data streams are therefore critical to support both research and trading effectively.

Q4. You mentioned in a previous presentation that managing a “Crypto ecosystem” is not an easy task. What is a Crypto ecosystem, and what is it useful for? What are the specific challenges you face, and how do you solve them?

Jad Sarmo: By “crypto ecosystem,” I mean the global, interconnected infrastructure where digital assets are traded: exchanges, OTC counterparties, and all supporting systems.

Each participant may be located in a different —Virginia, Tokyo, London, and beyond. Our system ingests high-frequency data from across the world, unifies it, and processes it with both low latency and high throughput.

The hardest part is normalizing inconsistent feeds so they’re useful across trading and research. Historically, AWS prioritized reliability over low latency, but in recent years, the biggest players—including B2C2—have worked closely with AWS to re-architect the cloud to meet the latency needs of crypto trading.

Q5. Let’s talk about the use of AI and Machine Learning in the financial services industry. You cannot predict the market by training an AI model on historical data, because things change rapidly in the financial markets. How do you handle this issue? Does it make sense to use AI?

Jad Sarmo: Forecasting financial time series is one of the most complex tasks in data science.

A picture of a dog from 10 years ago is still useful to train an image classifier—but financial data ages fast. Market structure, participants, and behaviour shift constantly, so models need regular recalibration.

Ensemble learning is particularly powerful in finance; rather than relying on a single predictive model, we combine many models that each perform slightly better than average. AI is not a crystal ball, but it provides meaningful signals that enhance traditional pricing and risk systems.

Q6. You have been leveraging a vector-native data platform at B2C2. Could you please explain what you do with such a data platform?

Jad Sarmo: We use KX’s kdb+ platform to support our real-time and historical time-series data needs. It enables global ingestion across AWS regions, persistent storage, replay of massive tick datasets, and complex event processing.

The consistency of this platform means researchers can focus on analysis without worrying about where the data lives. PyKX, a Python–Q hybrid notebook interface, allows heavy computations to run in Q, while using Python for exploratory analysis and ML.

KX also provides high-performance dashboards for quick data visualization—even by non-technical users.

Q7. Why not use a classical relational database or a key-value data store instead?

Jad Sarmo: Traditional relational databases are too rigid and slow for high-frequency time-series analytics. Key-value stores are great for quick lookups but lack native analytics support.

Vector-native platforms like kdb+ are designed for exactly this use case. They let us run complex queries over billions of rows in milliseconds—without reshaping the data or creating indexes.

As data volume grows to terabytes per day, traditional databases become engineering bottlenecks. In contrast, vector platforms scale naturally, with each column and date efficiently mapped to files.

Q8. Let’s go a bit deeper. If you start with “FeedHandlers,” how do you end up processing this complex data at scale, in real time and without losing some data?

Jad Sarmo: Our architecture begins with Java or Rust feed handlers that convert raw exchange data into kdb+ format.

A ticker plant then routes data to three layers:

            1          A real-time in-memory database

            2          A persistent on-disk database

            3          A complex event processor

This setup ensures we can act on data instantly, store it reliably, and support deep analytics—all with complete transparency for end users, whether they’re consuming live or historical data.

Q9. What about data quality? How do you ensure data quality in the various phases of data processing?

Jad Sarmo: Data quality starts with ingestion. Exchange feeds vary in reliability and format, so we normalize and hydrate the data immediately to remove inconsistencies.

We maintain constant feedback loops between research and production teams to monitor and improve quality. Clean, consistent data is the backbone of everything—without it, even the most sophisticated models won’t perform.

Q10. You decided to integrate AWS FSx for Lustre with kdb+. What are the main benefits of this design choice?

Jad Sarmo: AWS FSx for Lustre has been a major improvement. It offers virtually unlimited horizontal scaling and high-speed access. We can connect dozens or hundreds of nodes, each with fast local disk and compute, to form a massive high-performance network file system.

It compresses files efficiently, offloading that work from kdb+. We can spin up isolated research environments on demand without affecting production, and there’s no downtime. Auto-scaling lets us right-size our infrastructure at any time.

Compare that to traditional datacentres—provisioning takes weeks and usually leads to overbuying hardware. In the cloud, it’s a five-minute job.

Q11. How is industry regulation affecting this complex data management?

Jad Sarmo: Regulation is advancing quickly. This means we must store data in auditable, retrievable formats. End-to-end traceability—from ingestion to storage to downstream consumption—is non-negotiable.

This adds operational overhead, but it also emphasizes the need for trustworthy systems that meet both performance and compliance standards. We see this reflected in regulatory initiatives like the EU’s MiCA regulation, the approval of Bitcoin ETFs in the U.S., and the UK’s FCA Discussion Paper DP25/1, which explores regulating crypto asset activities.

……………………………………………………………

Jad Sarmo, Head of Quantitative Development | Expert in High-Performance Trading Systems, B2C2.

Jad Sarmo is a technology and trading infrastructure leader with over 20 years of experience building high-performance trading systems for FX and digital asset markets. He is currently Head of Quantitative Development at B2C2, a global leader in institutional liquidity for digital assets, where heoversees a global team delivering real-time pricing, exchange trading, and analytics infrastructure across 24/7 markets.

Prior to B2C2, Jad ran Technology at Dsquare Trading, a high-frequency proprietary FX trading firm that rose to prominence through cutting edge algorithms, low-latency engineering, and a world-class team. There, he designed ultra-fast trading systems and led cross-functional teams through years of continuous innovation in a high-stakes environment.

Jad specialises in bridging the gap between trading, quant research, and engineering — turning complex ideas into reliable, automated, and profitable systems. His expertise spans real-time architecture, algorithmic trading, market data, and risk management, with deep technical fluency in Java, Python, KDB+/q, and AWS.

Based in London, Jad is dedicated to designing robust systems under real-world constraints and mentoring the next generation of technologists.

………………………..

Follow us on X

Follow us on LinkedIn