Skip to content

Technical Architecture Focus: Scaling Pandas to Petabytes: The Architecture and Tradeoffs of BigQuery DataFrames. Interview with Ivan Santa Maria Filho

by Roberto V. Zicari on March 7, 2026

Q1. You mentioned that BigFrames represents an interesting case study in “how a large company like Google can use OSS without really using OSS in the codebase.” Can you unpack this paradox?

Specifically:

  • BigFrames provides a pandas API, but the actual execution happens in BigQuery’s SQL engine via transpilation through intermediate representations (Ibis, SQLGlot). What are the fundamental architectural tradeoffs you face when creating an API-compatible layer versus actually forking and extending the original codebase?
  • From a legal/IP perspective, what considerations drive Google’s decision to reimplement APIs rather than wrap or extend existing OSS libraries? Is this purely about licensing, or are there technical benefits to the “clean room implementation” approach?
  • When you inevitably discover that certain pandas operations can’t be efficiently mapped to BigQuery SQL primitives, how do you decide between: (a) dropping that operation from your API surface, (b) implementing workarounds that might surprise users with different performance characteristics, or (c) extending BigQuery itself to support the operation natively?

Ivan Santa Maria Filho: Over the past 6 years I’ve been either leading or owning large data warehouse products. That includes Microsoft Cosmos Analytics and Azure Data Lake Analytics, and more recently leading a group in Google BigQuery called “BeyondSQL”. All three of those products are widely used by data scientists across the industry and represent more than 20 years of innovation. Cosmos Analytics and Azure Data Lake analytics have their own programming language, and BigQuery is SQL centered. 

Both approaches have their merits and limitations. While a dedicated, proprietary language allowed us to innovate at Microsoft and build an amazing product, I believe that learning a proprietary programming language is not as interesting in 2026 as it was in 2008. People change jobs more often, and quite honestly Python seems to be the winner for data scientists. SQL, while widely used and familiar, does not have the best control flow and error handling semantics. BigQuery in general continues to advance SQL with extensions like BQML, but is also betting on Python and notebooks.

I believe Python won because it is fun to use, and quite honestly easier than a lot of other languages. It is growing in complexity, but I can see how a duck-typed, interpreted language would be more attractive to someone coming from an environment like Matlab, and leveraging a wide, awesome ecosystem of freely available libraries. My take is that the Python community did an exceptional job making it a very rich ecosystem, and got several large companies to contribute. I am looking forward to all performance improvements coming down their development pipeline.

Our strategy for features, just like the product itself, is to respect where our customers are. Data scientists like Python and notebooks, so they get Python and notebooks. Because data frames are a popular data abstraction, they get BigFrames.

We tried to keep the exact same semantics like, for example, implicit ordering. By default “head(5)” has “top(5)” semantics in BigFrames, which is a costly thing to do if the underlying data is a 1PB table without an index. If the user wants performance though, they can choose to relax the ordering semantics and have results faster and cheaper.

The architecture choice considerations were all technical. Our first implementation relied heavily on Ibis, and we love it, but we are now writing our own compiler layer. We want to make the BigFrames package smaller, and add BigQuery specific features without polluting Ibis with vendor specific details. We will continue to contribute to Ibis and in many cases they remain the right choice for developers.

BigFrames does not use any proprietary APIs, anyone could write something like it, but we work where we work, and we made specific choices that only make sense for BigQuery. For instance, we use the BigQuery store read/write streaming operations instead of running a “select *” query. We also implemented a client side smart cache that supports several predicate push-down techniques that are not general at all. We would love to see people extending BigFrames to other storage systems and data warehouses, but right now we are focused on BigQuery.

My team also developed support for managed Python functions in BigQuery. Those allow users to package almost anything from the Python ecosystem into a lambda / Cloud Run style function that can be “applied” to a data frame or series. For instance, the user can write a sophisticated image transformation function in sklearn, deploy it as a user defined function, and “.apply()” that function to a multimodal column in BigQuery. They can call Hugging Face from the user function too, or even host a lightweight model in Cloud Run. We take care of deployment, garbage collection, billing, and more, and they get to use anything from the OSS ecosystem when they wish.

As you point out, we found APIs that were hard to implement on top of BigQuery. We want to cover them all, but we prioritize by crawling public git projects and notebooks and sorting the functions by the most used, and by listening to our customers.

BigFrames has averaged two releases per month, and sometimes we go in directions we were not expecting because our customers asked for them, like implementing more visualization compatibility. We were expecting users to do data preparation for AI training, and data exploration was a bit of a surprise. BigFrames went from “not good” to “pretty good” in that space over last year.


Q2.  BigFrames claims support for 150+ pandas functions, which is impressive but still a fraction of pandas’ full API surface. What are the hardest categories of pandas operations to support at BigQuery scale?

More specifically:

  • Stateful operations: Pandas allows arbitrary Python code with mutable state across operations. How do you handle operations that fundamentally assume in-memory, row-by-row iteration when your execution model is distributed SQL?
  • Ordering semantics: BigQuery DataFrames 2.0 introduced “partial ordering” mode as an optimization. Can you explain the exact semantic differences between pandas’ strict ordering guarantees and BigFrames’ partial ordering? Under what conditions does this difference become user-visible, and how do you help data scientists understand when they can safely relax ordering for performance?
  • Lazy evaluation boundaries: Pandas is eagerly evaluated; BigFrames builds a query plan. When a user calls df.head() or to_pandas(), you materialize results. How do you manage the impedance mismatch where users expect immediate feedback but you’re optimizing for deferred execution? Have you seen cases where this lazy evaluation confused users or led to unexpected costs?

Ivan Santa Maria Filho: We currently cover 850 of the approximately 1,400 Pandas functions, depending on whether you count all the supported parameter types or not. 

Making ordering flexible is a very common design compromise for frameworks trying to make Pandas scale. For BigFrames we decided to let users choose the behavior they prefer. They can choose Pandas semantics with strict (consistent) ordering of rows, and calling an operator like “head()” multiple times will yield the same results every time, which requires the equivalent of an ORDER BY clause. This is expensive, and for complex indices, requires us to compute a column. If the user does not care about the ordering semantics, they can set a flag and BigFrames will avoid the ORDER BY operation. We also log warnings for all APIs that have implicit logging and, of course, allow the user to suppress the warning.

In some cases the user will be able to see a computed column with the complex index, which can cause compatibility issues. If the user explicitly names the columns they want, they see nothing. If they do not, they see any computed column we add. 

The lazy evaluation is another interesting compromise. BigQuery runs on top of really big clusters, with tens of thousands of servers each. It is designed to run complex queries, and has an advanced optimizer. The reason we do lazy evaluation is because all Pandas APIs are transformed into an abstract syntax tree, and the actual operations are pending execution. A BigFrames data frame is a “promise” of a data frame – a name, and a pending log of operations. When we execute the operations, they are all combined by the optimizer. We might detect that a later filter would remove rows from an earlier operation and filter first.

Map-reduce systems have always dealt with choices like “should we sort the data then hash it for a join, or should we hash, join then shuffle sort?”. By using lazy execution we give ourselves a chance to use the optimizations and save the user money and time. Depending on how the user is paying for BigQuery, the amount of scanned data matters for cost and we are, again, 100% focused on customers. The first version of BigFrames we shipped was too expensive, and today we are on par with SQL.

When it comes to stateful operations, we support it in two ways. The data frames in BigFrames are more of a promise of a data frame than an actual data frame. When reading data from BigQuery the data frame contains a reference to a server side snapshot of the table. When writing to BigQuery the append operations are kept local until enough changes accumulate and we flush them to a temp table, or the user does an operation that triggers the flush. The data frame also contains a log of pending transformations. The user can call execute() on the data frame and BigFrames will apply the transformations locally if possible, or just fetch the results, which will cause a global optimization of pending transformations and a server call. The server call might be a direct storage operation (read/write) or a SQL job.

We also support Python UDFs, and those can retain state themselves. When the user performs an “apply(function)” operation, the function might be a remote function, which supports full web applications as backend, or a Python Managed function. The user can, for instance, create a remote function that connects to Hugging Face, download a transformer, cache it offline, and expose an API call to BigQuery. We will only initialize the web application when we launch it or add new instances of it, but every call to the UDF will benefit from the state of the server. 


Q3. BigQuery’s UDF story has evolved from SQL/JavaScript UDFs that run in-process, to remote functions that call out to Cloud Functions, and now BigFrames 2.0 adds Python UDFs with a @udf decorator. Can you walk us through the architectural evolution and the limitations each approach addresses?

In particular:

  • Execution model tradeoffs: Running Python UDFs via Cloud Functions means network round-trips for every batch of rows. What’s the performance penalty in practice, and how do you amortize this cost through batching strategies? How large do result sets need to be before remote UDF overhead dominates total query time?
  • State management: Traditional UDFs can’t maintain state across invocations (by design, for parallelization). But data scientists often want to do things like “apply this pretrained ML model to every row” where loading the model once and reusing it would be far more efficient. How does BigFrames handle this? Can you cache model objects across UDF invocations, or does every batch reload from scratch?
  • Error handling and debugging: When a Python UDF crashes on row 4,782,391 of a 10-million-row table, how do data scientists debug this? What visibility do you provide into UDF execution, and how do you balance comprehensive logging with the cost/performance implications of collecting it at scale?
  • Security boundaries: Allowing arbitrary Python code to run is a massive security surface. How do you sandbox UDF execution to prevent: (a) accessing other customers’ data, (b) egress of sensitive data, (c) abuse of compute resources (crypto mining, etc.)?

Ivan Santa Maria Filho: I think it is important to say the UDFs are used by BigFrames, but users don’t need BigFrames to use them. They can declare and use them from SQL. We did not want to create a proprietary API for this, so we extended the public SQL API instead. This is a recurring theme for our team.

We expect the UDF space to evolve a lot in 2026 and 2027. BigQuery supports SQL UDFs, JavaScript UDFs, Remote Functions, and now Python managed UDFs. JS runs in a sandbox, which is itself inside a nested VM, running on the same set of machines as BigQuery workers. There is no network cost, but there are costs to launch the VM and inter process costs too. For remote and managed UDFs we currently run them on Cloud Run, and we have the network costs. What we do for those is to batch rows to amortize costs, and we have invested a significant amount of time to make the serialization and deserialization costs low.

This might sound counter-intuitive, but the biggest performance problem is not the network. The biggest challenge for us is to teach the optimizer how much individual UDFs take to process a row, and how many parallel calls we should be making, with how many rows on each call. For our first iteration we will ask users to help us by setting core counts, ram and concurrency level. We will give them telemetry and logging to let them make that call. Over time we want to watch the UDFs and adjust the settings automatically, but that will come later.

For your specific question, we support fairly complex UDFs. One of my first tests was to call Hugging Face from the UDF and set up a local pipeline (local to the UDF runtime, in Cloud Run). The UDF had two dozen Python functions defined, one to fetch my developer keys from our key service (KMS), another to take the key and download a text pipeline from hugging face, another to store the weights and setup a local cache, and so on. One of those Python functions was the UDF entry point.

When we instantiate the UDF, or auto-scale it by adding instances, we run the UDF body as if it was a main function in Python. I used that to setup the stateful model locally in the Cloud Run instance. When BigQuery calls the UDF, it calls the entry point function. You can find a similar example calling Google’s translation APIs – the client is instantiated only once.

We are considering a Python UDF version that runs in the shard like the JavaScript UDF, but it will depend on customer demand.

Error handling with data frames and Python is one of the advantages this approach has over SQL. If the user calls a function per data frame row, they can assign the return code to another data frame column. Then later use a filter to retry only the failed rows. SQL in general would force the user to retry the query again, which would run every row again. For example, let’s say you want to send emails to customers matching a given criteria using UDFs and SQL. Then assume that “SELECT send_email(customer_email) WHERE …” would select 10k users. If the send_email function fails for any of them, BigQuery would retry the entire job. The assumption of the SQL language is that send_mail() has no side effects until the entire job is successful, which is very likely not true. This is a very easy way to spam customers. Using Python and “apply()” the send_mail UDF can return a fail/pass return code, and a simple while loop can retry only the failed rows using a filter. This is also doable in SQL, but it is hard enough that makes for a good interview question.

Security is very important. Google enforces that all services and microservices have multiple security boundaries. For code running in the same machine as BigQuery processes, for example, user code runs on a sandbox, and the sandbox inside a gVisor VM. The gVisor VM has no IO stack, and very limited surface, and that is the public part of the solution. We have additional hardware, software, and network controls in place. 

For managed Python you can safely assume we have at least the same mitigations in place, very robust monitoring, plus we deploy the code to Cloud Run, which sits on another cluster using a restricted configuration. For functions running in Cloud Run it is possible to access the Internet, but the user has to specify a connection configuration, which includes a service account, grant that service account the correct permissions, and make sure the VPC settings in their project allows it. If the project is configured to have internet access, the UDF creator has the right to create service accounts and connections, and permissions to access the internet, then it is possible to copy the data outside Google. By default there is no Internet access, so the user has to do work to enable it.


Q4. You mentioned BigFrames would “certainly explain the limitations of BigQuery.” Let’s dig into that. What are the most significant BigQuery architectural decisions that constrain what BigFrames can do, and how do these manifest as surprising limitations for users?

For example:

  • Storage format constraints: BigQuery’s columnar storage and partitioning strategy presumably makes some pandas operations prohibitively expensive. What operations fall into this category? Are there pandas patterns that work fine on 10GB but break completely at 10TB due to BigQuery’s architecture?
  • Type system mismatches: Pandas supports Python’s dynamic typing; BigQuery has a strict schema. How do you handle cases where a pandas operation would dynamically change column types based on data content? Do you fail at query planning time, or try to infer schemas and potentially fail at execution time?
  • Result size limits: BigQuery DataFrames 2.0 changed allow_large_results to default to False, failing queries that return >10GB compressed data. This is a dramatic departure from pandas’ “it fits in RAM or it doesn’t” model. How do you help users understand when they’re bumping against this limit, and what patterns do you recommend for working around it (beyond just “set the flag to True”)?
  • Transaction semantics: Pandas DataFrames are just objects; mutations are immediate and in-memory. BigFrames operations compile to queries. What happens when users expect ACID transaction semantics (e.g., “update these 3 tables atomically”) but you’re generating separate SQL statements?

Ivan Santa Maria Filho: BigQuery is designed to support SQL, to scale to datasets with PBs of data, and to use highly optimized, controlled SQL engine operators. For what it was designed it works exceptionally well. When it comes to running arbitrary user code, I believe we could do much more.

Many choices get harder at scale. The simplest one to describe is supporting the implicit ordering of rows. If you have 1GB of data, dropping an index and computing a new one will take a couple of seconds. If you have 10TB that will take longer, maybe not linearly longer, but longer. There is no magical way to fix this problem.

We could pull a page from RDBMS and use a B-Tree and clustering keys as storage, but BigQuery reads data from multiple partitions in parallel, and the data would return in random order. We could use a single partition for data frames storage, but that would limit scale and performance. It would also force a table rebuild when the index changes. We could use B-Trees and secondary indices to simulating a table scan. We could inject sort operators over a computed index column. Every option consumes time and raises the cost to our users.

We are offering the Pandas semantics by default, so users are not surprised, but also a mode more similar to what Polars and databases do. If our customers tell us this is acceptable, we would make it the default, otherwise continue to look for the best way to gain scale with the Pandas semantics.

The type mismatches are always a problem. Python uses duck-typing, but it also supports a very rich type system, with several Python libraries having their own data types, both simple and complex types. BigQuery is strongly typed, so we cannot just pass the bytes around, we have to convert from what is stored in the BigQuery cells to something that makes sense in Python. Those conversions can be expensive, particularly if the user is applying a UDF to a column or data frame. The data will be in BigQuery and passed to the UDF row wise  or column wise depending on the call syntax, and the way that works is, BigQuery will partition the table holding the data frame data, and send each partition to a worker. This worker will read the data from our store and send it to the worker hosting the UDF. We do what we can to optimize this step, but that does not change the fact that the data in the store is in a different encoding than what Python expects. Even timestamps have different resolutions in BigQuery.

The result set size has a dual purpose. Certain operations have no inherent limits other than BigQuery limits. Applying a UDFs over rows will scale well, and because of it the user might even realize they are scanning hundreds of TBs of data. That can become really expensive, and the only billing surprise we like is when the price is lower than expected. The size limit is an attempt to avoid bad surprises.

The other purpose is to avoid crashing a notebook. If the user tries to render 10GB of data points in a notebook widget, odds are that will crash the notebook. One unique problem with very large datasets and series is that one cannot just plot every point. They also cannot just naively sample the data because they might miss a maximum, minimum, or anomalous data point. We are considering adding decimation algorithms to reduce the granularity of the series but retain its shape, maybe building that into BigFrames, but ideally contributing this to an OSS project.

As far as acid semantics go, BigFrames does not support complex transaction boundaries. There is no way to express that changes to two data frames should both be committed or not committed. That said, for a single data frame BigFrames uses “copy on mutate” approach, writing all changes to a new “backing table” then linking the client object to the resulting table if everything goes right. We could investigate a way to have cross-data frame transactions, but never got that requirement.


Q5.  Looking forward, we’re seeing an explosion of “pandas-like” APIs: Dask, Modin, Polars, BigFrames, Snowpark Python, Databricks pandas API on Spark. Is the data science ecosystem converging toward pandas as a universal interface, or are we headed for fragmentation as each implementation adds vendor-specific extensions?

More philosophically:

  • API surface versioning: Pandas releases new versions regularly with API changes. How does BigFrames handle pandas version compatibility? Do you target a specific pandas version, or try to track the latest? What happens when pandas adds a feature you can’t support efficiently in BigQuery?
  • Beyond pandas: You mentioned that BigFrames 2.0 adds multimodal capabilities for unstructured data (images, text). Pandas wasn’t designed for this. At what point does extending the pandas API for new use cases become counterproductive, and you should just design a new API that’s purpose-built for distributed, multimodal data processing?
  • ML integration: BigFrames includes bigframes.ml with a scikit-learn-like API for BigQuery ML. But modern ML workflows involve PyTorch, TensorFlow, Hugging Face transformers, etc. How do you see the integration of these frameworks evolving? Will we see bigframes.torch or bigframes.transformers, or is there a fundamental mismatch between these frameworks’ execution models and BigQuery’s architecture?
  • Standards vs. ecosystems: Would the data science community benefit from a formal standard for “distributed dataframe APIs” (similar to how SQL standardized relational queries), or is the current Cambrian explosion of implementations actually healthy for innovation?

Ivan Santa Maria Filho: For API versioning, we follow the same model the OSS community does, with major and minor versions. We are expecting many large updates from Python and Pandas this year, and keeping up with the changes. 

My take is that the ecosystem will continue to fragment for a while, and that is not necessarily bad. We have enough innovation in this space that both clients and backends are evolving and have diverse feature sets. It is quite hard to offer a smooth, common surface across backends, without compromising performance and / or cost. By the time any industry gets to be fully standardized, that is usually the time it is also commoditized, and investment slows.

The BigQuery team added support for multi-modal data, auto-generation of embeddings, and auto-quantization of models, making extraction and inferencing way cheaper. Most data in enterprises everywhere is not structured. The amount of data stored in documents, intranet pages, email, calendars, and collaboration / chat tools is way higher than data curated in tables. 

I don’t see the point of hiding this functionality from customers, but I also don’t want to pollute the Pandas API namespace. We try to be as explicit as possible, so users know what is, and what is not a Pandas default API, but we make our extensions interoperable. 

For example, it is fairly easy to perform sentiment analysis on a support phone call audio recording, then join the sentiment and user data in BigQuery so a CRM application can track how happy the customer was, and what were the issues they cared about.

It is getting increasingly easy to instruct an agent to watch the general sentiment around a product and only warn us when something changes. 

The development around agents makes it harder to predict the future of Pandas-like frameworks. Given the current investment level, fragmentation is a natural evolution of this space, but if we achieve an agentic solution that produces results by answering questions in English, the mechanisms to handle data will be less popular.

The agents themselves will need a language to express what they want, but the number of direct active users might go down drastically. We might finally end up with something similar to the Star Trek Enterprise computer, and at that point I just don’t see a regular data scientist or business analyst writing Python directly. 

…………………………………………………………………………

Ivan Santa Maria Filho has a BSc and MSc in computer science and a wide variety of experiences as individual contributor and manager, having owned a small software company and worked on multiple billion dollar products and services at Microsoft, Meta and Google. His main areas of expertise include vertical integration of stateful, large scale services with ephemeral VM infrastructure, and the infrastructure itself. Ivan Santa Maria Filho has a BSc and MSc in computer science and a wide variety of experiences as individual contributor and manager, having owned a small software company and worked on multiple billion dollar products and services at Microsoft, Meta and Google. His main areas of expertise include vertical integration of stateful, large scale services with ephemeral VM infrastructure, and the infrastructure itself.




Additional Context for ODBMS.org Readers:

What is BigFrames? BigQuery DataFrames (BigFrames) is an open-source Python library that provides a pandas-compatible API for analyzing data stored in BigQuery. Unlike pandas, which loads data into local memory, BigFrames translates operations into BigQuery SQL, enabling data scientists to work with terabyte-scale datasets using familiar pandas syntax.

Why does this matter? Most data scientists learn pandas, but pandas doesn’t scale beyond single-machine memory limits. BigFrames (and competitors like Databricks pandas API, Snowpark Python) represent a new generation of tools that preserve familiar APIs while transparently distributing computation. Understanding the tradeoffs in these systems helps organizations choose the right tools and helps researchers understand the limits of API compatibility.

Key Technical Innovation: BigFrames uses a transpilation approach: pandas operations → Ibis intermediate representation → SQLGlot SQL generation → BigQuery execution. This allows Google to avoid directly bundling pandas code while maintaining API compatibility – a fascinating case study in software architecture and licensing strategy.

……………………………..

Follow us on X

Follow us on LinkedIn

Edit this

From → Uncategorized

No comments yet

Leave a Reply

Note: HTML is allowed. Your email address will not be published.

Subscribe to this comment feed via RSS