Beyond the Benchmark: Reva Schwartz on Measuring AI’s Real-World Impact.
Q1. Reva, you’ve spent two decades evaluating automated technologies in high-stakes settings — from your work as a forensic scientist at the U.S. Secret Service to leading NIST’s AI risk management efforts. How has your understanding of AI’s risks to people and society evolved over this time, and what patterns have you observed in how organizations actually use AI versus how they think they’re using it?
A: Over time, I’ve watched AI move from tightly controlled, specialized settings to a world where people ask chatbots about almost anything. That shift has made one pattern very clear: organizations often treat data and technology as a cleaner and more objective solution than grappling with their long‑term human and organizational challenges.
Instead of modeling those factors in context, they are typically swapped out for an abstraction layer of large‑scale datasets and tools inserted into workflows. Of course, taking people out of view doesn’t make the real‑world challenges go away; it just makes them harder to see and more complex to address. The irony is that working on those human and organizational pieces is often more tractable than people assume—it just draws on a different kind of expertise than building or buying more technology.
Like almost any other technology, AI’s most challenging risks come less from what the tools can do and more from how people choose to leverage them. That space is massively heterogeneous, and we will never predict every configuration in advance—nor do we need to. But we do need to bring people, context, and organizational dynamics back into our models of AI if we want to understand and manage its real impacts.
Q2. You were chief architect of NIST’s AI RMF Playbook and founded the Assessing Risks and Impacts of AI (ARIA) program. From your experience working with diverse organizations implementing the AI Risk Management Framework, what are the most common gaps between understanding AI risks conceptually and being able to measure and manage them practically? What does it actually take to move from framework adoption to meaningful risk mitigation?
A: One persistent gap is how we talk about “risk” itself. We often perceive risk only as something to eliminate, when historically “taking a risk” also meant opening the door to innovation. In today’s AI conversations, organizations focus so much on hunting for negative impacts that they miss the chance to ask what these tools actually mean in their own setting.
A major issue is that we often view risk through the wrong end of the telescope. We focus heavily on the technical properties of models and under‑invest in understanding the broader sociotechnical system those models sit inside. Organizations sometimes talk as if AI systems are external forces acting on them, rather than tools they configure, constrain, and are accountable for. Moving from framework adoption to real risk mitigation means taking that control back and treating these systems as something that can be measured and adjusted in your own context—not just something you tune with model metrics.
That work is pretty “old school”: talking across functions, mapping workflows, understanding handoffs, and building shared language about desired outcomes and acceptable trade‑offs. AI models can absolutely support that, but it can’t do it for you. In the end, meaningful risk management is people staying in the driver’s seat—using AI frameworks to learn from their own deployments and bring the whole team into decisions about how, where, and whether to use these systems
Q3. Your work uniquely combines measurement science, social science, and machine learning — disciplines that often operate in separate silos. Through Civitaas Insights, you’re developing evaluation methods to understand how AI reshapes culture and society in the real world. Can you share examples of AI impacts that traditional technical evaluations miss entirely, and how your multidisciplinary approach reveals risks or opportunities that purely technical or purely social science perspectives would overlook?
A: The “what” we measure is tightly linked to how we measure. Over‑reliance is a good example. Many people worry about AI causing cognitive skills to atrophy—which is a higher‑order effect—but most evaluations are stuck in “primary‑effects land,” assessing the suitability of system responses to canned prompts.
These approaches essentially “give the system a survey” and score its output with blunt instruments like keyword matching and simple accuracy labels. With attention locked on model outputs, evaluation misses what happens next—what people actually do with those outputs. Do they accept them without question, double‑check them, or ignore them? For a real‑world risk like over‑reliance, those questions matter at least as much as response accuracy or quality, because they show how these technologies shape behavior and consequences over time.
Another risk is hallucinations. Their presence is a primary effect, but what really matters is the downstream impact—especially when it contributes to harm. The stakes are very different if people blindly copy a hallucinated answer versus noticing something looks off and adjusting their behavior. Benchmarks don’t capture those real‑world user actions and workarounds, so they miss the difference between systems that quietly train people to over‑rely and systems that nudge people to stay alert.
That’s where combining measurement science, social science, and ML matters. It lets us design evaluations that track not just static model outputs, but what actually happens when real people use these tools in dynamic settings. This produces the kind of information that organizations need to make informed decisions about their AI deployments.
Q4. Bias in AI has become a major focus area, but the term itself means different things to different stakeholders. Based on your research leading NIST’s work on bias in AI and advising organizations through VernacuLab, how should organizations think about AI bias in ways that go beyond fairness metrics and technical definitions? What practical advice would you give to companies struggling to translate bias concerns into actionable evaluation and governance strategies?
A: Two high-level points I always try to get across. First, bias itself isn’t automatically bad, just like any AI risk isn’t automatically bad. Context matters. Many AI systems need some form of bias to be useful at all—a recommender that doesn’t prioritize, filter, and “lean” toward some options over others would just be noise. What we care about are harmful biases: patterns that systematically disadvantage people or groups, create discrimination, or distort decisions in ways that conflict with an organization’s values or obligations.
Second, bias is not only a computational phenomenon; it can originate from humans, from the institutions around us, and from the computational layer–and those three continually interact. This is why a sociotechnical frame is so important. It keeps us from flattening “bias” into a single disparate-impact statistic and instead forces us to think about context: biased in what direction, for whom, in which setting, and with what consequences? Only with that contextual knowledge can we tell whether a bias is creating harm—and, if it is, which levers to use to mitigate it, whether that means changing data and models, adjusting policies and workflows, or rethinking how people interact with the system.
The practical shift is to treat bias less as a one‑time technical fix and more as an ongoing organizational practice. When companies do that–and use evaluation methods that tie measurements back to their actual workflows–they can move from worrying about “AI bias” in the abstract to having concrete levers they can adjust and monitor over the life of a system. The good news is that nobody knows their context better than the people inside the organization, so applying a risk‑aware culture to bias is ultimately about equipping and empowering them to see, debate, and act on these issues in their own setting.
Q5. Looking at the current trajectory of AI development and deployment — from generative AI to increasingly autonomous systems — where do you see the biggest disconnect between the pace of AI innovation and our ability to assess its value, utility, and societal impacts? What measurement science capabilities or evaluation methodologies do we urgently need to develop to ensure AI systems actually serve human and societal wellbeing rather than just optimizing narrow technical objectives?
When I think about how we measure AI right now, I keep coming back to the word “mismatch.” The biggest mismatch is between the wide set of questions people actually have about AI and the narrow set of questions we currently evaluate. When technologies get as woven into everyday life as AI is now—think elevators, airplanes, or the internet—how we test them usually broadens too. We’ve hit the point with AI where we can move beyond assessments that answer only the questions on the minds of the people who build the technology and start exploring the questions everyone else has about what AI means in their lives, for better and for worse.
Because that shift hasn’t really happened yet, we’ve ended up with an evaluation ecosystem that doesn’t give decision makers what they need. Organizations are trying to decide where to invest, what to deploy, and how to manage risk, but most of the available evidence is still about model capabilities, not real‑world impact. They’re left guessing about why adoption is slow, why value is lagging, and which interventions actually help.
My current work is designed to help fill this gap. I focus on building methods and tooling for real‑world AI evaluation that bring people directly into the equation. These are large‑scale, longitudinal evaluations that connect system behavior with how people actually leverage these tools, the contexts in which they’re used, and the outcomes they produce. This offers decision makers the kinds of insights they actually need—where AI adds value and where it falls short. With that kind of evidence, we no longer have to guess which technologies create good outcomes and which don’t—we can see it, measure it, and act on it.
Qx. Anything else you wish to add?
One thing I’d add is that none of this happens without a much more interdisciplinary ecosystem. Right now, too much of AI evaluation lives in a handful of technical environments, and that’s part of why we get such a narrow view of value and risk. The ecosystem we need pulls in social scientists, domain experts, communities, designers, and policymakers alongside ML folks, so we’re co‑designing questions, methods, and metrics with the people who actually live with the technology. If AI is going to reshape whole institutions and sectors, then the people who are closest to the deployment context have to be in the room helping decide what “good” looks like and how we measure it.
……………………………………………………………………..

Reva Schwartz Co-Founder, Civitaas | Research Scientis Washington, District of Columbia, United States