Production AI is a Data Pipeline Problem: Paul Speciale on Why Storage is the Bottleneck Nobody Plans For.
Q1. Your research found that 57% of enterprises prioritize storage performance to avoid AI bottlenecks — actually ahead of compute and GPU availability. That surprises a lot of people, since the AI conversation is dominated by compute. What is actually happening inside these production environments that makes data infrastructure such a critical constraint, and why does this tend to catch organizations off guard?
A: What’s actually happening in production AI environments is that the limiting factor isn’t the availability of GPUs, but whether data can be delivered to them fast enough to keep them fully utilized. Training and inference workloads depend on complex, multi-stage data pipelines (ingestion, preprocessing, shuffling, checkpointing, and retrieval) that generate intense I/O and metadata pressure most traditional storage systems simply weren’t designed for. Our research bears this out: when we asked enterprises where they focus to prevent bottlenecks, storage performance came in at 57%, actually ahead of compute/GPU availability at 54% and network bandwidth at 52%. As GPU clusters scale rapidly, storage and data infrastructure often fail to keep pace, creating a widening imbalance where expensive GPUs sit idle waiting on data.
Organizations get caught off guard because early AI planning tends to focus on the obvious constraints (GPUs and model architecture) while underestimating how central data movement becomes at production scale. Metadata handling is a particular blind spot. Our research found it was the area least likely to be addressed during initial planning, and only 29% of newer adopters are highly focused on it. By the time teams realize the bottleneck is in the data layer rather than the compute layer, they’re already dealing with performance problems that are much harder to fix retroactively.
Q2. The research shows that 81% of enterprises consider private AI infrastructure critical to their success, driven by sovereignty, compliance, and data proximity. But building and maintaining private AI infrastructure is not cheap or simple. How should organizations think about the trade-offs between private AI and cloud-based AI services — and what does it actually take to make private AI work at scale?
A: Organizations should think about private AI versus cloud AI as a clear trade-off between control and compliance on one side, and speed and elasticity on the other. But our research shows that 81% of enterprises now see private AI infrastructure as critical, driven by sovereignty, compliance, and the need to run AI applications in close proximity to the on-premises data they depend on. In practice, the data confirms that most organizations land on a pragmatic hybrid approach: 38% of respondents described their AI data location as hybrid, with cloud still playing a role even among committed private AI adopters.
What makes private AI work at scale is less about raw spending and more about avoiding a piecemeal approach. One of the strongest themes in the research is the danger of reactive, project-by-project infrastructure decisions that accumulate technical debt and dead-end investments. The organizations getting this right are settling on a small number of versatile architectural options early (storage platforms that can be configured for different performance and resilience profiles across the AI pipeline) rather than procuring specialist point solutions for each new project. That architectural consistency is what separates teams that scale smoothly from those that hit a wall. The barrier to entry has also come down considerably as GPU price/performance has improved, making private AI viable for a much broader set of organizations than even two years ago.
Q3. Object storage underpins 91% of private AI deployments in production — a striking number. Yet many IT professionals still associate object storage primarily with backup and archiving rather than active AI pipelines. What has changed technically and architecturally that makes object storage so central to production AI, and what do infrastructure teams need to understand about it that they probably don’t yet?
A: What has changed is that AI has turned object storage from a passive archive layer into the very backbone of active data pipelines. As you can see in our research findings, 91% of private AI deployments in production now rely on object storage, largely because modern AI workloads involve continuous movement of large, unstructured datasets for training, fine-tuning, and inference. The reason is architectural. Modern AI workloads involve continuous movement of large, unstructured datasets across training, fine-tuning, and inference stages. Storage is now disaggregated from compute and designed to serve distributed AI systems via APIs like S3 rather than traditional file-based access, which makes object storage extremely well-suited for the scalability, metadata-rich workflows, and hybrid environments that enterprise AI demands.
Q4. The data shows that organizations are converging on tiered, hybrid architectures rather than greenfield deployments — with 44% adapting existing compute and 42% adapting existing storage for AI. In practice, what are the biggest mistakes organizations make when trying to retrofit existing infrastructure for AI workloads, and what separates the teams that do this successfully from those that struggle?
A: Our research made clear that most organizations are retrofitting rather than rebuilding, with 44% adapting existing compute and 42% adapting existing storage for AI workloads. That’s a very rational starting point, but the biggest mistake we see is treating AI as a bolt-on to legacy infrastructure rather than rethinking data flows for the continuous, high-throughput pipelines that AI requires. Teams frequently underestimate how quickly traditional file-based or siloed storage becomes a bottleneck once workloads scale beyond initial pilots.
The research gives us a clear picture of what separates successful teams. Seasoned adopters, as those with significant breadth and depth of AI experience are far more likely to purpose-build their storage infrastructure (56% versus just 28% of newer adopters). More importantly, they define storage requirements during initial project planning rather than discovering them through trial and error after deployment. Among seasoned adopters, 69% finalize storage requirements upfront, compared to only 40% of new starters. The teams that succeed also adopt tiered, hybrid architectures and decouple compute from storage to support different AI access patterns. They invest early in data engineering and pipeline orchestration, not just compute scaling. In contrast, struggling teams pour resources into GPUs while leaving the underlying data infrastructure largely unchanged and then wonder why performance doesn’t improve proportionally.
Q5. Metadata handling at scale and mixed workload management are flagged as significant bottleneck risks. These are not glamorous topics, but they seem to be where production AI actually breaks down. Can you walk us through what these challenges look like in the real world — and what good infrastructure design looks like to address them before they become a problem?
A: Not glamorous indeed, but our research confirms they’re where real production problems emerge. At scale, tracking datasets, model versions, and lineage across distributed pipelines becomes a serious governance and performance challenge. What makes metadata particularly treacherous is that it’s the area most likely to be neglected during planning. Our research found that even among seasoned adopters, 52% only address metadata handling after early deployment experience rather than planning for it upfront. For newer teams, the gap is even wider.
Mixed workload contention is equally concrete. Model training demands high-throughput sequential access to massive datasets, while runtime inference requires low-latency random access to model parameters and reference data. Production environments typically run both simultaneously alongside data preprocessing, creating fundamentally different demands on the same infrastructure. Without proper isolation or prioritization, these workloads interfere with each other in ways that are difficult to diagnose.
Good infrastructure design treats metadata as a first-class capability from the outset to enable efficient indexing, search, and lifecycle management across the storage layer. It also separates and orchestrates workloads using tiered or policy-driven architectures so that different AI tasks get the performance profiles they need without starving the other. The key principle is designing for observability and control upfront, rather than trying to retrofit solutions once systems are already under pressure. The research makes clear that organizations who invest in this kind of architectural thinking early end up with significantly greater confidence in their infrastructure’s ability to meet both current and future demands.
…………………………………………………….

Paul Speciale, CMO Scality
Over 20 years of experience in Technology Marketing & Product Management. Key member of team at four high-profile startup companies and two fortune 500 companies.
Sponsored by Scality