On High-dimensional computational geometry and databases. Q&A with Vissarion Fisikopoulos

by Roberto Zicari · Published December 1, 2025 · Updated December 1, 2025

Q1. MySQL’s spatial query optimizer must handle complex geometric predicates efficiently. Can you walk us through the key challenges in optimizing spatial queries—particularly for operations like spatial joins and nearest-neighbor searches? How do computational geometry techniques like R-trees, spatial indexing, and geometric filtering algorithms influence the optimizer’s cost model and execution strategy?

A. Spatial query optimization is fundamentally different from traditional optimization of non-spatial queries in relational databases. In the latter case, predicates are simple and cheap to evaluate, distributions can be approximated by histograms, and cost models rely on well-established cardinality estimates. Spatial workloads violate almost all these assumptions.

MySQL, and spatial DBMSs more broadly, must deal with geometric objects of varying size, arbitrary shapes, and non-uniform distributions in multidimensional space. As a result, spatial optimization is centered not around relational algebra techniques, but around computational geometry techniques that reduce search space and avoid expensive geometric computations whenever possible. As of MySQL, some of these principles are explored in the recent MySQL GIS engineering blogs, which demonstrate how spatial data can be transformed, compared, and interpreted across different reference systems and use cases within the database system.

A fundamental challenge in optimizing spatial queries is that the spatial predicates are orders of magnitude more expensive computationally than scalar comparisons. Spatial predicates such as ST_Intersects, ST_Contains or nearest-neighbor distance searches may involve many geometric computations each of which has complexity that typically scales linearly with the size of the input geometries. The implications to the query optimizer are profound; it must aggressively minimize the number of spatial predicate evaluations. This is where spatial indexing and filtering come into play.

Another challenge is that spatial distributions are often clustered, skewed, or anisotropic. Two spatial datasets can have large, overlapping envelopes, dense pockets of points, elongated or irregular geometries. This makes traditional 1-dimensional statistics such as histograms and value frequency counts ineffective. Selectivity estimation for spatial predicates becomes inherently approximate. Therefore the optimizer must rely on geometric heuristics and bounding-box statistics rather than traditional cardinality estimates.

A typical spatial operation that those challenges arise is spatial joins. Spatial joins are one of the most important operations for combining spatial objects of several relations. The efficient processing of a spatial join is extremely important since its execution time is superlinear in the number of spatial objects of the participating relations, and this number of objects may be very high. On the other hand, relational join algorithms such as hash join and merge join cannot efficiently be applied to spatial joins. To solve this, one approach is to exploit R-trees for the efficient processing of spatial joins. By recursively joining only pairs of tree nodes whose bounding rectangles intersect, the optimizer drastically reduces candidate pairs and performs exact geometric tests only when necessary, achieving highly efficient spatial join processing.

Another challenging task in spatial databases is optimizing nearest-neighbor (kNN) queries. Distance predicates do not align with the one-dimensional ordering assumptions of traditional B-tree indexes. Instead, the query optimizer must navigate multi-dimensional search spaces where distances grow in non-monotonic ways, making selectivity estimation difficult and highly dependent on the underlying data distribution. Efficient plans rely on spatial indexes such as R-trees, yet their effectiveness varies with data density, clustering, and the shape of bounding rectangles, which can cause large portions of the index to be explored unnecessarily. Furthermore, exact distance computations are expensive, especially for complex geometries, and even more so for geometries defined on the spherical or ellipsoidal surface used to model the Earth. As a result, the optimizer must carefully balance lower-bound pruning, incremental branch-and-bound search, and early termination to avoid excessive refinement and unnecessary computation.

Geometric filtering algorithms, such as bounding-box tests and filter-refine cascades, operate independently of, yet often in combination with R-trees, giving the optimizer inexpensive early checks that eliminate non-qualifying geometries before exact computations are invoked. Their selectivity directly shapes the cost model and execution strategy: effective filters make R-tree plans attractive and highly pruned, while weaker filters may push the optimizer toward alternative access paths or join orders to avoid excessive refinement work.

Last but not least, ensuring robustness in geometric operations is a fundamental challenge for spatial query optimization. Even small floating-point roundoff errors can lead to incorrect predicate evaluations, such as misclassifying intersections, distances, or containment, which in turn undermine both query correctness and the optimizer’s assumptions about selectivity and plan efficiency. If geometric kernels are not numerically stable, bounding-box filters may pass or reject candidates incorrectly, spatial joins may miss valid pairs, and nearest-neighbor searches may prune or retain branches improperly, leading to incomplete or overly costly results. To mitigate this, spatial databases rely on robust computational geometry algorithms and careful numeric handling so that the optimizer can trust geometric predicates and efficiently plan queries without compensating for unpredictable or inconsistent results.

Q2. Traditional spatial databases typically handle 2D or 3D geometries, but many modern applications require higher-dimensional spaces (e.g., geospatial-temporal data, or feature vectors in ML pipelines). What are the fundamental challenges that arise when extending spatial query processing to higher dimensions, and are there lessons from your work on high-dimensional sampling and computational geometry that could inform next-generation spatial database architectures?

A. For moderate dimensions, around 5-10, depending on the application, the 2D or 3D algorithms and data structures or their specializations usually work. However, in higher dimensions the curse of dimensionality takes place. This has consequences in performance and memory efficiency of traditionally low dimensional algorithms and data structures. In particular, indexes based on R-trees degenerate to full index scan. For spatio-temporal data, trajectories, or ML feature vectors this shows up as heavy node overlap and huge candidate sets in range and nearest neighbor search. At the same time, modern workloads introduce far more complex distance notions, combining spatial, temporal, and even semantic or learned similarity measures, which the traditional optimizer was never designed to model. These multidimensional and often non-Euclidean metrics break standard histogram-based selectivity estimation, leaving the optimizer without reliable statistics to decide when an index will prune effectively or when a scan is cheaper.

However, modern techniques have emerged that bypass these limitations entirely: approximate nearest-neighbor (ANN) libraries such as FAISS, HNSW, and ScaNN, along with vector databases like Pinecone, Weaviate, and Milvus, have shown that high-dimensional similarity search can be made practical at scale by abandoning exact pruning and embracing graph-based navigation, quantization, and learned embeddings. Their success in powering AI retrieval systems demonstrates that, in high dimensions, robustness and performance come not from traditional spatial optimizers but from data-driven, approximate structures explicitly designed for complex, learned distance functions.

I believe that work in high-dimensional sampling offers useful insights for future spatial database architectures. ANN libraries and vector databases succeed in AI workloads because they abandon exact pruning and instead rely on probabilistic, data-driven navigation of high-dimensional spaces. This mirrors what we see in high-dimensional sampling and volume computation: when exact geometric algorithms break down, techniques like Monte Carlo and geometric random walks make inference tractable. These connections suggest that future query optimizers may benefit from incorporating sampling-based cost estimation, geometric preconditioning, and probabilistic search strategies as core elements, drawing on principles that have already proven successful in high-dimensional machine-learning systems.

Q3. Your FOSDEM talk on the volesti project addresses efficient sampling from high-dimensional distributions—a problem central to statistics, optimization, and machine learning. Could you explain the geometric intuition behind random walk methods like Hit-and-Run or Hamiltonian Monte Carlo in convex bodies? How does the geometric structure of the feasible region impact sampling efficiency, and what are the practical scalability limits of these methods?

Sampling from high-dimensional distributions constrained to a convex body is a fundamental problem at the intersection of geometry, statistics, and optimization. A variety of random-walk methods can be used in this setting. In Hit-and-Run, the sampler repeatedly selects a random direction and moves to a point drawn from a suitable distribution along the chord defined by the intersection of that direction with the convex body. Geometrically, this allows the walk to make long moves that quickly reach distant regions of the body, leading to substantially faster convergence than more local walks such as the Ball walk or Grid walk. Hamiltonian Monte Carlo (HMC) takes a different approach: it simulates nonlinear trajectories governed by Hamiltonian dynamics. For constrained HMC, there are two main variants. In the Riemannian approach, the dynamics are defined so that the trajectory remains entirely within the convex body, typically a polytope, by construction. In the reflected approach, the trajectory evolves freely until it hits the boundary, at which point it reflects according to the tangent plane, producing a billiard-like path. This reflected method applies to any convex body for which the boundary’s tangent plane can be computed at a given boundary point.

The geometric structure of the feasible region has a profound effect on sampling efficiency because it determines how easily a random walk can traverse the space. Well-rounded or approximately isotropic convex bodies allow rapid mixing: the walk can move in any direction with comparable step sizes, enabling uniform exploration. In contrast, elongated, skinny, or highly ill-conditioned bodies slow mixing dramatically, as many directions generate short chords or narrow passages that trap the walk locally. Sharp corners, flat facets, and high-curvature regions further hinder methods like reflected Hamiltonian Monte Carlo or billiard walks by causing frequent reflections or numerical instability. For this reason, geometric rounding, applying an affine transformation to bring the body closer to isotropy, is often a prerequisite for efficient high-dimensional sampling. At the same time, certain methods such as Riemannian Hamiltonian Monte Carlo and the Dikin walk incorporate local curvature information, allowing them to adapt more naturally to bodies that are far from isotropic and to maintain good mixing.

When practical sampling is concerned, the main challenge is determining when the sampler has actually converged, i.e., whether the generated points are sufficiently close to the target distribution. Theoretical mixing-time bounds offer guarantees, but the hidden constants are so enormous that the bounds become unusable in practice. As a result, practitioners rely on empirical diagnostics, most notably the Potential Scale Reduction Factor (PSRF) and Effective Sample Size (ESS), to assess convergence and sample quality. With these tools and modern random-walk algorithms, it is now possible to obtain high-quality approximate samples from polytopes in thousands of dimensions within a few hours on moderate hardware, demonstrating that practical high-dimensional sampling is well within reach.

Q4. High-dimensional computational geometry might seem distant from traditional database concerns, yet problems like volume estimation, integration over polytopes, and convex optimization appear in query optimization, approximate query processing, and probabilistic databases. Do you see specific opportunities where geometric algorithms could enhance modern data analytics platforms—perhaps in areas like cardinality estimation, multi-objective query optimization, or uncertainty quantification?

A. There is a potential that modern data analytics platforms can leverage methods from high-dimensional computational geometry to improve cardinality estimation for database queries. For example, a multi-attribute query with filters can be viewed as carving out a region (e.g. a convex polytope) in the data’s attribute space, and the query’s cardinality is essentially the volume of that region under the data’s joint distribution. In the most general case, that volume becomes the computation of an integral. Computing volumes and integrals in high dimensions is challenging and sampling techniques utilizing random walks can be used for efficient approximations. In addition to practical sampling-based methods, geometric insights also offer theoretical guarantees for cardinality. An example is the Loomis–Whitney inequality, that can be used to provide worst-case size bounds for special classes of conjunctive queries. These and other geometric inequalities show that viewing query selectivity through volumes and projections could yield bounds on selectivity estimation.

Geometric methods are equally powerful in multi-objective query optimization, where an optimizer must balance trade-offs across multiple cost metrics (such as latency, memory usage, energy, or monetary cost). Instead of a single cost value, each query execution plan can be represented as a point in an n-dimensional cost space. The goal is to find the set of Pareto-optimal plans, those lying on the Pareto frontier where no other plan is better in all cost dimensions. This Pareto frontier often forms a curved “surface” in cost space, and computational geometry provides tools to approximate or compute it efficiently.

Geometric and integration techniques could also play a crucial role in uncertainty quantification for data analytics, for example, in probabilistic databases and approximate query processing systems. In these scenarios, query results are not deterministic values but distributions or estimates with confidence intervals. Computing these requires evaluating probabilities over complex, high-dimensional spaces defined by uncertain data. This can be framed geometrically: one often needs to integrate a probability density function over a region defined by query constraints (a polytope or a semi-algebraic set in general). Weighted Model Integration (WMI) formalizes exactly this process by treating probabilistic query evaluation as the task of integrating a density over all assignments satisfying a logical formula, effectively reducing uncertainty reasoning to computing high-dimensional integrals.

Q5. Moving geometric algorithms from theoretical results to production database systems involves navigating trade-offs between exactness and approximation, worst-case and average-case complexity, and memory vs. CPU efficiency. Drawing from your experience with MySQL’s spatial features and research on high-dimensional algorithms, what are the most important principles for translating computational geometry research into performant, reliable database functionality that practitioners can depend on?

A. There are parts of a database engine where geometric results are logically binding: they determine which rows appear in the result, whether constraints hold, or whether an index lookup hits or misses. In those cases, even a tiny numerical error can produce an incorrect answer. For MySQL’s spatial stack, this includes operations such as ST_Intersects or ST_Contains in WHERE/JOIN clauses and the geometry used in spatial index keys. The principle here is that whenever a geometric computation influences a yes/no decision, it must be robust, consistent, and behave as though it were exact. That means implementing robust predicates (e.g., adaptive-precision orientation tests or filtering techniques) instead of relying on ad hoc epsilons. Cheap MBR-based filters and approximations are perfectly acceptable at early stages of query processing, but the final decision must pass through a kernel engineered for correctness, regression-tested on degenerate data, and stable under floating-point quirks.

In contrast, some workflows inside a DBMS are naturally approximate: cardinality estimation, cost modeling, probabilistic query answering, approximate analytics, and even certain distance-based ranking tasks. In these cases, geometric algorithms should be treated as estimators and paired with statistical guarantees. A challenge, as noted above, is that when randomness is introduced to achieve efficiency and scalability, the underlying theoretical foundations, such as mixing-time guarantees of random walks, must be translated into practical, reliable implementations. Doing so requires careful parameter selection, systematic empirical evaluation, and appropriate statistical tooling to ensure that the resulting estimators remain scalable while providing confidence measures that are both meaningful and trustworthy for the system and its users.

The third principle is to avoid over-engineering: use the simplest geometric model that is good enough for the workload, and reserve sophisticated geometric algorithms for the relatively few scenarios that genuinely require them. For most geospatial applications, for example, a spherical distance, or even a lightweight ellipsoidal approximation, is more than adequate; full geodesic series expansions on an ellipsoid, with complete edge-case handling, are only worth their cost in survey-grade or global-coverage applications. In other words, a production DBMS should keep advanced algorithms in its toolbox, but deploy them selectively, guided by the data characteristics, accuracy requirements, and performance constraints of the task at hand.

Q6. Anything else you wish to add?

One thing I would add is that computational geometry can play several complementary roles across the data-management software stack, not as a universal solution, but as a useful toolset and source of intuition. In some places, such as spatial databases or geometric indexing, it naturally serves as a backend, providing the primitives and data structures needed for correctness and performance. In other areas, geometry offers a conceptual lens that helps clarify the structure of a problem: viewing selectivity estimation as volume computation, or multi-objective optimization as navigating a high-dimensional surface. It can also function as a testbed for experimentation, where geometric algorithms inspire new heuristics, sampling methods, or approximation schemes with a potential to become practical components of query optimizers. And sometimes geometry simply provides intuition and interpretation, helping designers and researchers understand why certain approaches work well in high dimensions or why others break down. Hence it is natural to believe that geometry will not solve every challenge, but it could consistently provide guidance, technically and conceptually, in places where modern data systems need it.

References

Bartels et al., “Fast floating-point filters for robust predicates”, BIT Numerical Mathematics, 2023. Link:https://link.springer.com/article/10.1007/s10543-023-00975-x
Chalkis & Fisikopoulos, “volesti: Volume Approximation and Sampling for Convex Polytopes in R”, The R Journal, 2021. Link: https://journal.r-project.org/articles/RJ-2021-077/
Chalkis et al., “Truncated Log-concave Sampling for Convex Bodies with Reflective Hamiltonian Monte Carlo”, ACM TOMS 2023. Link: https://dl.acm.org/doi/10.1145/3589505
Cheng et al., “Efficient Indexing Methods for Probabilistic Threshold Queries over Uncertain Data”, VLDB 2004. Link: https://www.vldb.org/conf/2004/RS22P2.PDF
Kettner et al., “Classroom examples of robustness problems in geometric computations”, Computational Geometry, 2008. Link: https://www.sciencedirect.com/science/article/pii/S0925772107000697
Kolb et al., “How to Exploit Structure while Solving Weighted Model Integration Problems”, PMLR, 2020. Link: https://proceedings.mlr.press/v115/kolb20a.html
Kook et al., “Sampling with Riemannian Hamiltonian Monte Carlo in a Constrained Space”, NeurIPS 2022. Link: https://proceedings.neurips.cc/paper_files/paper/2022/file/cdaa7f07b0c5a7803927d20aa717132e-Paper-Conference.pdf
Papadias et al., “Processing and Optimization of Multiway Spatial Joins Using R-trees”, PODS 1999. Link:https://dl.acm.org/doi/10.1145/303976.303981
Trummer & Koch, “Multi-Objective Parametric Query Optimization”, PVLDB 2015. Link:https://www.vldb.org/pvldb/vol8/p221-trummer.pdf
Yang et al., “Deep Unsupervised Cardinality Estimation”, PVLDB 2020. Link:https://vldb.org/pvldb/vol13/p279-yang.pdf

……………………………………………..

Vissarion Fisikopoulos

Vissarion leads development at GeomScale, a NumFOCUS-affiliated organization advancing scalable geometric computing. His work spans computational geometry, spatial databases, statistics, and algorithm engineering, drawing on extensive experience in both academia and industry. He holds a PhD in Computer Science and has held research positions in academic and research institutions, including his current appointment at the National Technical University of Athens. He spent a decade at Oracle/MySQL, contributing to major open-source projects such as the Boost C++ Libraries and MySQL.

He serves as an editor for the Journal of Open Source Software and has mentored more than twenty students through Google Summer of Code, as well as MSc and PhD programs. A regular speaker at technology and scientific conferences, Vissarion teaches graduate courses in computational geometry and algorithm engineering.

On High-dimensional computational geometry and databases. Q&A with Vissarion Fisikopoulos

References

You may also like...

Resources

Search

News

Events

Archives

Sponsored By

InterSystems

MySQL/Oracle

Supporters

McObject

Progress

Raima

Scality

Undo

Volt Active Data