Trust Is Not a Feeling: Nuno Galante Valério on Engineering Accountability into AI for High-Stakes Healthcare
“The way most AI conversations use “trust,” it names a feeling – and you can’t engineer a feeling.”
Q1. What do the builders of AI consistently fail to understand about deploying their work in a GxP environment, where the cost of being wrong is measured in patient safety?
Nuno Galante Valério: If I have to choose one thing: they don’t feel the distance between a demo that works and a system you can deploy. That distance is the entire job, where the whole effort is. It’s where I’ve spent my career.
I’ve sat through this meeting many times: a vendor, or one of our own teams, shows me something that genuinely impresses the room. The model reads a batch record, finds the deviation, drafts the CAPA, and does it faster and more carefully than the person who used to. Someone says the word “production-ready,” and means it. So, I ask them to run it again, same input. They do, and what comes back is almost the same. A sentence in a different order. A risk worded a little differently. A reference that was there the first time and, the second time, quietly isn’t. The mood in the room changes, because everyone understands at once that “almost the same” is not something you can write into a validation report, and put your name under.
Now, the easy lesson to draw from that room is the wrong one – that generative systems are too unstable to let near anything that matters. Europe’s first instinct, in its draft guidance for AI in manufacturing, was close to that: keep these models away from critical operations. The part I find genuinely interesting is that the direction is already moving off it, toward a risk-based view, and I think that correction is right. It turns on a distinction the builders almost never start from: risk is a property of the function, not of the technology. A frozen, deterministic model making a release decision with nobody checking it is more dangerous than a probabilistic one drafting something a qualified person reviews before it goes anywhere. The variation I provoked in the room was never the hazard; the hazard is letting any output, stable or not, reach a place you can’t walk it back from, without a control built to catch it. It’s why, when my team sizes up an AI use, the first questions aren’t about the model at all – they’re how critical the function is, how much the thing decides on its own, and whether we’d even notice it going wrong.
Here is what the builders are actually missing, and they miss it because everything in their world rewards them for missing it. They optimize for capability — can the system do the task, well, fast. The regulated world doesn’t start there. It starts somewhere stranger: can you tell me, in advance and in writing, the edge of what this thing will do, so that inside the edge I’m never surprised, and outside it I can prove I had something in place to catch it. And the failure that keeps me awake isn’t the one the demo shows. The demo shows what the model catches. I’m paid to worry about what it misses, because a miss in my world doesn’t raise its hand. A false alarm announces itself and someone investigates; a missed signal just sits there, looking like nothing happened.
So, the failure isn’t really technical. Most of these people are far better engineers than I’ll ever be. What they haven’t done, what they’ve never been asked to do, is be the person whose name goes on the line that says I am accountable for what this does in front of a patient, and for what it fails to do. If you’ve never had to sign that, “it works” feels like the finish. Once you have, “it works” is maybe halfway, and the easy half. The other half has no demo in it. It’s building the argument for why the risk that remains is acceptable, and then defending that argument to an inspector whose job is to assume you got it wrong.
I don’t say this to be hard on them; you can’t really know it until you’ve lived it. I say it because the most interesting work in the field right now is sitting in that gap, between “it works” and “I’d stake my name on it”; and almost nobody upstream has noticed the gap is even there.
Q2. Give us a concrete example where the governance process was itself the site of genuine innovation – where something was invented that would not have existed without it.
Nuno Galante Valério: The honest, real version of this starts with a failure, because the useful thing came out of the failure itself.
We had a system – document-grounded, retrieval-based, the kind that answers a quality question by pulling from a controlled procedure corpus rather than from the model’s own memory. By every measure we had, it passed. Retrieval was solid, the prompts frozen, the version pinned, the test cases green. The validation evidence was complete. And as the process owner, I wouldn’t give my sign-off. Not because I could point at a defect (I couldn’t, the validation was clean) but because “the protocol passed” and “I’ll stand behind this running in my process for the next eighteen months” are not the same statement, and the second one is what my signature actually carries.
Sitting in that gap, is what sent me recently to Petri Pohjanen. He’d spent years in automotive functional safety – ISO 26262, the world where software steers a moving car and a wrong output is a crash (not a typo) – and he’d held release authority, so he had personally signed the kind of statement I was hesitating over. Automotive had already solved, twenty years ago, a version of the exact thing I was stuck on: how do you take responsibility for a system you can’t test exhaustively. Their answer was never to make it deterministic. It was the safety case: a structured, layered argument that the risk of failure that remains is low enough to accept, with evidence under each layer. I’d been trying to discharge with a test report, something that was only ever going to yield to an argument.
What came out of working together we called the Layered Assurance Stack; work that Petri and I are still developing in the open. Three layers that deliberately don’t collapse into one another. The first is what the system is allowed to do in the first place. The second is how it can fail in ways that have nothing to do with a broken component (this is where automotive’s SOTIF thinking carries over, the failures that come not from a part breaking but from the system meeting a situation outside the assumptions it was designed around). The third, is what has to exist inside the organization to catch those failures, while it’s running. Run the three together, and you get a proportionality result: how much assurance this particular use, in this particular context, actually needs. We gave the result a name and a set of tiers, but honestly the name is the least interesting part of it. The moment you name a tier, people start treating it as a standard instead of as the answer to a question, and the thinking stops.
Here’s the part that wouldn’t exist without the governance problem forcing it. What pharma was missing was never a better test. It was a language for arguing about probabilistic systems that an auditor can actually follow. The field had two reflexes: set the temperature to zero, and pretend you’ve made the thing deterministic; or refuse to deploy at all – and both are answers to a question nobody should be asking.
There was nothing in between, so we had to build the in-between. And the only reason we could is that I’d hit a wall where my existing tools told me a system was fine and my own judgment told me it wasn’t, and I refused to settle that, by trusting the tools over the judgment.
The cost of it, since this series is about honesty and not press releases: it’s slow. It needs a certain organizational maturity. It needs you to disagree, sometimes sharply, with people you respect. And it needs patience to build at any real scale. The vocabulary is further along than the adoption, right now. Closing that distance is the part still in front of me and many of my peers.
Q3. ICH E6(R3) and the broader GxP framework assume deterministic, validated software. Generative AI is probabilistic and non-deterministic. How are you and your peers actually handling that tension in practice – not in principle?
Nuno Galante Valério: In practice, it gets handled by moving what you validate, which is a far quieter answer than the public debate would suggest.
The initial instinct is to ask how you validate the model. That question has no good answer, because the model is the part that won’t hold still. So, the people doing this seriously validate something else: the process made of a human and a system together, with the model sitting inside a control envelope as one component, rather than being the thing on trial. You don’t qualify the language model. You qualify the workflow around it – a person of defined competence reviewing the output against a defined standard, with the boundaries written down and the failure modes named before you start. The model is allowed to be probabilistic, as long as the process containing it is controlled. And that isn’t a dodge. It’s the same move we’ve always made with people: we never validated the analyst’s mind; we validated the procedure the analyst worked inside, because the analyst was fallible too and we knew it.
The second shift is harder, and it’s the one really unsettling – the move from validating at a point in time, to monitoring over time. Classical validation works as a photograph. You show the system was right on that day you tested it, then you freeze it. But there’s a thread in the interpretability research, Anthropic’s among it, about the gap between the reasoning a model states and the computation it actually performs. Take that seriously enough, and the photograph stops meaning much. If a system can drift, and if the reasons it gives you aren’t reliably the reasons it acted on, then proving it was correct on day one tells you very little about day two hundred. Validation has to become something closer to surveillance. You’re not proving correctness once; you’re sampling for it, continuously, against a population of inputs that keeps moving under you.
That points at a role with no name yet, which I think is the single most important unbuilt thing in the field. Some hybrid of quality assurance and data science – a person who can read a control chart and a model card with equal fluency, who watches a production AI system the way a process engineer watches a control strategy. That person isn’t on the pharma org chart, yet. The data scientists rarely think in GxP (actually, often avoid it) and the quality people rarely think in distributions, so whoever holds both frames at once, has usually arrived there by accident. Somebody is going to have to build that into a profession on purpose.
So, the honest answer to “how are you handling it”: imperfectly, and by learning as we go. The frameworks haven’t caught up, so for now it’s people building the bridge while they’re standing on it. Uncomfortable. It’s also, I’d argue, the fastest way to find out what the bridge actually has to carry.
Q4. You lead a “trust architecture” for AI in GxP. What does trust actually mean as an engineering requirement – how do you decompose it into properties that can be specified, tested, monitored, and maintained?
Nuno Galante Valério: I’d start by taking the word back from itself, because the way most AI conversations use “trust,” it names a feeling – and you can’t engineer a feeling. What you can engineer are the conditions that make the feeling unnecessary. A patient swallows a tablet without auditing the supply chain behind it. Not because they’ve decided to believe in it, but because a century of architecture has already absorbed the complexity, so they don’t have to. That absorbed, invisible structure is what trust actually is, once you stop treating it as an emotion. And notice where it lives: not in the tablet, but in everything standing behind it. With AI it’s the same, and it’s the whole reason I named the work the way I did: the trust that matters was never going to live inside the model. It lives in what you build around it.
So, the question I work on is: what does a system have to do, structurally, before it earns that kind of invisibility. Looking across pharmaceutical regulation, aviation, banking, nuclear, food safety, the blood supply, the machinery of courts and professions – seven functions kept reappearing. Not because they’re the only things present in any one regime, but because their absence is what turns up in the post-mortem,whenever trust collapses. Thalidomide was a surveillance failure. The 2008 Crisis was a failure of provenance and verifiability. Tuskegee – men left untreated for a disease that had a cure – was a failure of recourse.. Each one fails in its own characteristic way, and the mature version of every trust regime is, if you look closely, the scar tissue from once having been missing that function.
The seven are provenance, verifiability, accountability, reversibility, legibility, recourse, and surveillance. Rather than march through all seven, I’ll share how they group, because the grouping is what does the work. Provenance and verifiability are the is-it-what-it-claims pair: can you trace every component to its origin, and can someone not aligned with the maker check the claims independently. For most production AI in 2026, the honest answer to both is “not really” – we often cannot say who labelled the training data, or under what consent, and frontier evaluation is largely self-reported by the lab that trained the model, on benchmarks it partly designed. Accountability and reversibility are the can-it-be-answered-for-and-undone pair. Legibility and recourse are the can-the-affected-human-see-it-and-get-a-remedy pair. And surveillance stands alone – the population-level function, that catches the slow, aggregate harm that no single user would ever notice in themselves.
People ask why seven, and not five or nine. Because seven is the smallest set that survives a comparative test. Drop one and you find you’ve fused two functions that do genuinely different jobs; add one and you’ve split a function into halves that were never really independent. I’m not claiming it’s the only taxonomy anyone could draw. I’m just claiming you can’t remove a piece without losing something you needed, or add one without repeating yourself. That’s a falsifiable claim, which is the most I can honestly offer – and I’d be glad to be proven wrong.
Where it gets interesting is that GxP doesn’t weight the seven evenly. The three that pharma tends to underbuild are, awkwardly, the three that decide whether AI is deployable at all.
Surveillance is the one the non-determinism question kept circling. Point-in-time qualification is just a photograph; a system that can drift needs continuous monitoring against a population that moves. Pharma already knows how to do this for drugs – it’s called pharmacovigilance. It just hasn’t started doing it for models.
Reversibility almost nobody builds, and in a regulated setting it’s unforgiving, because so many of the actions an AI touches can’t be taken back. You can recall a batch. You cannot easily un-make a decision that’s already propagated into a regulatory submission or a patient’s record. So, reversibility here is less an “undo button” and more a question: “is there a containment boundary that catches a wrong output before it becomes irreversible”. That’s a design property, it costs money, and it’s usually the first to be cut when a team is chasing capability.
And recourse is the one the engineering-minded want to leave out, and the one I can’t let them. When the system is wrong about something that matters, is there a path for the human to remediate it. A system can be perfectly provenanced, verifiable, accountable, reversible, legible, and surveilled, and still be untrustworthy if being wrong about you carries no fixing. Recourse is the function that remembers there is a person at the end of all this, not just a number or metric. It’s also the one with no clean home, in most architectures; which is exactly why it goes missing.
Decomposed this way, trust stops being a vibe in a vendor pitch (that truly doesn’t help anyone) and becomes a set of functions you can specify, assign owners to, test against, and audit. The work of a trust architecture is exactly that translation – taking a word everyone nods at (and instinctively understands), and turning it into seven things someone has to be accountable for. The moment trust has an owner and a test, it isn’t a feeling anymore. It’s engineering.
Q5. Cerf, Kay, Stroustrup, Booch built foundations others stand on. You’re building the governance and trust infrastructure that decides whether AI can stand on those foundations in one of the highest-stakes domains there is. Looking at the next decade – what needs to be built that doesn’t yet exist, without which the most important AI applications in medicine simply won’t be deployable at scale?
Nuno Galante Valério: Two things. The second is much harder than the first, and almost no one is working on it.
The first is a regulatory science that can reason about distributions, not just instances. Our whole evidentiary tradition rests on the qualified instance: this system, tested, frozen, proven. What we need is a science, that knows how to accept evidence of a different shape: this system stays within acceptable bounds, across a whole population of inputs, monitored continuously, with these statistical guarantees. That’s a different standard of proof. Regulators are edging toward it – the FDA’s predetermined change control thinking, the EMA’s Annex 22 work – but edging toward something isn’t the same as having it. Until an inspector can be trained on what “good” looks like, for a monitored probabilistic system, every deployment is negotiated from scratch, and you can’t scale a thing that has to be negotiated every single time.
The second, is the one I actually care about, and the hardest. We need governance that can hold disagreement, without collapsing it. Nearly every framework I know, the good ones included, and mine included, works by reducing a complex system to a single verdict: approved, or classified, or certified, take your pick. One number, one answer. But the systems we govern now don’t have a single answer inside them. A model can be safe for one use, and a hazard in the one next to it. It can be defensible to one stakeholder, and unaccountable to another. It can be right on average, and catastrophic in a certain use case. Force all of that into one verdict, and you haven’t governed the complexity, you’ve basically hidden it. What we don’t have yet – in standards, in regulatory science, in how we design organizations – is a way to hold several legitimate, competing assessments at once, and stay coherent without flattening or averaging them. I’ve come to see that less as a compliance problem than as an architecture problem, which is why I think it’s the one that actually decides whether the important applications ship.
Which is the thread running under all of your questions, and the thing they keep nearly asking. So let me say it plainly in the next question, since you’ve left me the room to do it.
Qx. Having answered these, what’s the one thing you most wanted to say – about governance, about trust, about what innovation looks like from inside a regulated environment – that none of the questions gave you the right opening to say?
Nuno Galante Valério: That the hardest problem in AI governance isn’t technical, and the reason the field keeps treating it as though it were, is that we inherited our instincts from a generation of builders who worked in a world that behaved the same way twice.
The foundations your series has documented – the protocols, the languages, the methods – share a property so deep, that it’s almost invisible: they’re deterministic. Same input, same output, every time. That isn’t incidental to how Cerf or Stroustrup think; it’s the ground they built on, and it’s a magnificent ground. It made software something you could reason about, prove things about, trust. The entire apparatus I work inside – validation, qualification, the regulated assurance of software – is downstream of that same assumption. Trust meant predictability, and predictability meant the thing had a single, stable, knowable behaviour.
The systems we’re building now don’t have that. A generative model has no single stable behaviour to validate – it has a distribution of behaviours, some excellent, some dangerous, none of them sovereign over the others. And here is what I’ve come to believe, and what I most wanted to say: this isn’t a defect we’ll engineer away. It’s the nature of the thing, and it’s the same nature that shows up the moment you look at any sufficiently complex system that has to act in the world. An organization is not a single coherent decision-maker; it’s a contest of legitimate, competing internal claims that somehow has to produce one decision. A regulatory regime is a parliament, not a person. Even a single expert under pressure is rarely one unified voice – they’re a negotiation. We have spent a century pretending these things are unitary, because unitary things are easier to hold accountable. The pretence is now breaking, because the technology we’ve built is the first one that refuses to perform the unity.
So, the governance problem I think actually matters – more than any specific standard or framework, including my own – is this: how do you make a thing trustworthy when it cannot be made to govern itself from the inside. The deterministic answer was always “constrain it until its behaviour is single and predictable.” That answer is exhausted. It doesn’t work on models, and if we’re honest, it never really worked on institutions either; we just had the luxury of pretending and it was still mostly ok. The answer that does work, is architectural. You stop trying to force the internal multiplicity into a single obedient self, and you build the external structure – provenance, verifiability, surveillance, recourse – that lets a system which is genuinely plural on the inside, still be answerable on the outside. You govern the multiplicity, instead of denying it.
This is why I think people one step downstream of the technology – in the regulated trenches, where the cost of being wrong is a patient and not a metric – have something to contribute that the foundation-builders and frontier labs, for all their brilliance, are not positioned to see. They built a world that holds still. We’re learning to govern one that doesn’t. The governance of multiplicity – holding many competing, legitimate voices accountable without flattening them into one false answer – is, I’m increasingly convinced, the same problem at every scale: inside a model, inside an organization, inside a regulatory regime. Get it right in one place and you’ve learned something about all of them.
I’ll admit I didn’t arrive at that view purely from the regulatory work. It’s the kind of conviction you reach the long way around, through more than one part of a life. But the questions were generous enough to give it a professional home, and that’s the version worth putting on the record.
So, that’s the thing none of the questions asked. Thank you for the room to say it.
………………………………………………………………….

Nuno Valério is Head of Innovation for R&D Quality at Merck Healthcare in Darmstadt, where he leads AI governance for GxP-regulated pharmaceutical environments. A clinical pharmacist by training (MSc, Universidade de Coimbra), he has spent twelve years at Merck, moving from compliance into digital innovation leadership. He is the author of Trust Architecture, a seven-function framework — provenance, verifiability, accountability, reversibility, legibility, recourse, and surveillance — for making probabilistic AI systems trustworthy enough to deploy at scale. He writes the Trust Architecture newsletter and speaks regularly on what it takes to treat trust as something you engineer rather than something you simply feel.
……………………………..

