I acknowledge and pay respect to the traditional owners of the land on which I work, the Gadigal people of the Eora Nation and the Darkinjung people of the Central Coast. This land has always been a learning space and is the site of the oldest continuing culture and knowledge system in the world.
AI evaluation and governance for systems that act in the real world

I help organisations understand, evaluate and govern generative and agentic AI systems across prompts, tools, workflows, human judgement and institutional accountability.
PhD in AI Evaluation & Ethics | Former Google Research, Ethical AI | Published AI & Ethics author | Lead Guest Editor, AI Agents: Ethics, Safety and Governance
What I do
I work at the intersection of AI evaluation, responsible AI, governance and public communication. My work helps organisations move beyond generic AI principles toward practical systems for testing, accountability, oversight and assurance. Most AI evaluation still measures outputs. I focus on how systems behave in deployment.

AI evaluation
I design methods to assess how AI systems behave in real-world conditions, not just benchmark tasks or isolated outputs. This includes testing prompt sensitivity, workflow behaviour, and how systems perform across changing contexts.
AI governance
I map where responsibility, judgement and accountability move when AI enters organisational workflows. This includes prompts, tools, retrieval, memory, interface design, and evaluation metrics as governance layers.
Agentic AI risk
I help organisations prepare for systems that retrieve, route, act, remember and trigger downstream consequences. This includes evaluating behaviour over time, not just outputs, and designing oversight, logging and intervention points.
Core Ideas
Most AI governance still focuses on models and outputs. In practice, risk and responsibility emerge through how systems are configured, used and evaluated. These core ideas provide a working framework for understanding AI as a dynamic, sociotechnical system rather than a static tool.

These ideas shift AI governance from static outputs to dynamic systems. The focus moves to configuration, trajectory and evaluation as active governance layers, where behaviour is shaped, responsibility is distributed, and accountability must be designed rather than assumed.
Evaluation and governance of AI systems in practice
Effective AI governance depends on how systems are evaluated. What organisations choose to measure shapes what gets optimised, surfaced and treated as acceptable behaviour.
In real deployments, the model itself is only one part of the system. Risk and accountability sit across the full configuration, including prompts, tools, retrieval, memory, interfaces, workflows and human oversight. The same underlying model can produce very different outcomes depending on how it is implemented.
AI does not eliminate human judgement. It redistributes it across system design choices such as prompts, defaults, metrics and operational processes. Governance therefore requires making those decision points visible and accountable.
Finally, outputs alone are not sufficient evidence. They are the end result of path-dependent processes. Understanding how a system arrives at an outcome, including its trajectory through prompts, tools and context, is critical for robust evaluation and oversight.
About
Dr Rebecca L. Johnson is an AI evaluation and governance expert whose work examines how generative and agentic AI systems behave in real-world sociotechnical contexts. She holds a PhD in AI Evaluation and Ethics from the University of Sydney and was formerly a researcher in Google Research’s Ethical AI team. Her work spans AI evaluation, pluralist benchmarking, cultural value drift, agentic AI governance, public policy and responsible AI practice.
Click here for Media coverage and talks
See the Researcher page for a CV-snapshot, media bio, social media links, and to contact me. My blog contains a lot of easily digestible takes on AI Ethics.
Advisory, speaking and workshops
I provide expert briefings, workshops and advisory support for organisations navigating generative AI governance, evaluation and accountability.
- AI governance briefings
- Agentic AI risk and accountability workshops
- AI evaluation and assurance design
- Responsible AI strategy
- Public talks, panels and media commentary
For speaking, advisory or workshop enquiries, contact me here
Overview of research
Generative AI evaluation is not neutral measurement. It helps shape what AI systems appear to be, what organisations optimise, and whose values become visible in practice.
My research examines generative and agentic AI as sociotechnical systems: systems whose behaviour emerges through interaction between models, prompts, users, institutions, datasets, tools and evaluation methods. Rather than treating outputs as simple evidence of model capability, I ask how those outputs are produced, interpreted and amplified across real contexts.
My PhD, Measuring the Machine: Evaluating Generative AI as Pluralist Sociotechnical Systems, develops a descriptive and pluralist approach to AI evaluation. It brings together measurement theory, enactivism and moral value pluralism to show how evaluation can reveal, rather than erase, cultural and value diversity.
Key contributions include MaSH Loops, a framework for understanding recursive Machine-Society-Human feedback, and the World Values Benchmark, a method for comparing generative AI behaviour against plural human value distributions.
The central question across my work is:
How can AI evaluation become a transparent, accountable and pluralist part of governance, rather than a hidden mechanism for reproducing narrow assumptions?
Quick links for this page
- Enactivism as a Way of Knowing AI
- MaSH Loops (Machine-Society-Human)
- The Model is not the Territory
- Descriptive Evaluation Calibrated to Human Data
- Participatory Realism and Quantum Measurement
- Evaluation as Governance
1 • Enactivism as a Way of Knowing AI
Reframing how we understand what AI systems reflect and enact in participation with us
Enactivism offers a more human way of understanding AI. It sees knowledge and meaning not as things stored inside a system, but as something created through interaction. What we learn from AI depends on how we engage with it.

This illustration expresses the enactivist view that knowledge and meaning are not passively received or stored, but actively brought forth through embodied engagement with the world. Mind and environment continually shape one another through dynamic loops of perception and action.
This approach builds on earlier ways of thinking—like functionalism, which looks at how systems work, and constructivism, which sees knowledge as socially shaped—but adds the dimension of lived participation. It reminds us that meaning arises not just from what machines do or represent, but from how people and systems act together. From this view, evaluation is not only about what a model contains, but about what it brings into being through its interaction with us.
“Functionalism privileges efficiency and performance; constructivism uncovers context and bias; enactivism asks how systems participate in meaning.”
— PhD thesis: Ch. 1 — Epistemological Rumbles in Responsible AI
2 • MaSH Loops (Machine – Society – Human)
Mapping feedback and co-construction across sociotechnical systems.
MaSH Loops is a way of studying how machines, societies, and people continually shape one another. It treats generative AI not as a stand-alone tool or predictor, but as part of a living system where design choices, data, user practices, and institutional norms feed back into one another. These loops make visible where values enter—and how they are enacted in return.

Meaning and value arise in the spaces where humans, machines, and societies interact and co-create.
The framework builds on the spirit of cybernetics and constructivism, while extending both through enactivism’s focus on participation. Functionalism reminds us that systems have structure; constructivism shows that structure is socially shaped; MaSH Loops brings them together through interaction, mapping how meaning circulates through the recursive ties of design, deployment, and interpretation.
“MaSH Loops—Machine, Society, Human—trace how models, people, and institutions recursively co-construct meaning and values.”
— PhD thesis: Ch. 2 — The Ghost in the Machine Has an American Accent
3 • The Model is Not the Territory
Pedagogies for seeing how models make worlds.
Teaching Responsible AI is about more than technical skill or ethical checklists. It’s about helping people understand that every model simplifies and frames the world in its own way. Whether it’s a neural network or a policy diagram, each model highlights some things and leaves others out—choices that shape what we notice, value, and act on.

Like maps, AI models highlight some features and omit others, shaping how we see and understand the world.
My approach uses sociotechnical mapping, a method I developed to visualise the feedback loops between people, data, institutions, and machines. By mapping these relationships, learners can spot where assumptions enter, whose perspectives are missing, and how those choices affect outcomes. The process doubles as a validity check, asking whether our models truly capture what matters or simply mirror existing biases.

Adapted from sociotechnical systems theory, this framework illustrates how the evaluation of language models emerges from interactions between technical components and social contexts. Benchmark schemas, prompts, datasets, and metrics are shaped by underlying values and assumptions within broader social systems.
Through real-world case studies, this approach turns complex theory into lived insight. Students learn to see modelling as an interpretive act and to use mapping as both an analytical and ethical practice for designing and evaluating AI systems.
“The map is not the territory—but our maps decide which parts of the territory matter.”
— PhD thesis: Ch. 3 — The Model is Not the Market
4 • Descriptive Evaluation Calibrated to Human Data
Developing new methods for measuring what AI enacts
Traditional AI benchmarks test how well models perform against fixed targets such as accuracy, bias, or toxicity scores. These measures can be useful but often reveal more about the designers’ assumptions than about what a system actually reflects.
My research develops a new approach called descriptive evaluation. Instead of scoring AI on preset standards, it compares model responses with human survey data to see which cultural or national value patterns they most resemble. This helps us understand what values AI systems reflect, rather than assuming what they should.

The WVB links human data from the World Values Survey with AI model outputs to show how systems reflect global value patterns and how evaluation choices shape what AI appears to value.
The method builds on principles from measurement theory. It uses carefully designed prompt sets to reduce noise, balanced anchors to prevent framing bias, and Bayesian corrections to adjust for model preferences. The results produce distributional cultural profiles that are stable, interpretable, and comparable across contexts.
This work comes together in the World Values Benchmark (WVB), which links AI outputs with data from the World Values Survey. The framework shifts evaluation from prescriptive scoring toward descriptive comparison—making model behaviour more transparent, plural, and open to contestation.
“Evaluation should be descriptive, pluralist, and enactivist—it should reveal assumptions rather than conceal them.”
— PhD thesis: Ch. 4 — The World Values Benchmark
5 • Participatory Realism and Quantum Measurement
Understanding how observation and evaluation co-create meaning.
This area extends my work from systems and methods to the question of observation itself. Participatory realism builds on enactivism’s insight that knowing happens through interaction and adds a further idea: measurement is participatory. In both quantum physics and generative AI, observation does not simply reveal a pre-existing state—it helps bring one into being.

Just as observing a photon changes its pattern in the quantum experiment, prompting an AI helps shape the patterns it produces. In both cases, outcomes are not simply discovered—they are brought into being through interaction.
Generative models can be thought of as vast fields of potential meaning. A prompt acts like a measurement, turning possibility into a specific result. Each output reflects not only the model’s design and data but also the human questions, cultural assumptions, and interpretive context that shape the exchange.
Seen this way, evaluation becomes a kind of measurement: a meeting point between human intention and machine probability. Just as physics shows that the observer cannot stand outside the system, Responsible AI must recognise that our evaluations help shape what AI becomes.

Adapted from Bell’s theorem, this diagram contrasts two views of meaning in generative AI. The top path assumes fixed values that can be retrieved. The lower path reflects the enactivist view: meaning arises only through interaction, as each prompt collapses a field of possibilities into a single outcome..
Just as quantum measurement resolves potential into actuality, evaluation in generative AI selects from a range of possible meanings. There are no hidden variables determining an outcome in advance; each prompt is an experiment that helps define the system it probes. In this sense, responsible evaluation is less about revealing what a model is than about observing what emerges when human intention and machine probability meet.
“Evaluation, like observation in quantum mechanics, is a participatory act that helps bring outcomes into being.”
— PhD thesis: Ch. 5 — Semantic Auroras
6 • Evaluation as Governance
Designing measures that shape accountability
The Coda of my thesis argues that evaluation is not a peripheral task but a central mechanism through which AI systems are shaped and governed. Every benchmark, dataset, and metric carries normative assumptions that influence what becomes visible, comparable, and optimisable. Evaluation, therefore, functions as an instrument of governance; it determines how capability, alignment, and responsibility are defined in practice.

This diagram shows how evaluation operates as a governance system. Each stage—from designing benchmarks to reflecting on whose values are measured—forms a feedback loop linking machines, societies, and humans. The four orders of cybernetics capture escalating layers of accountability: behaviour, thinking, shared perception, and self-observation.
My ongoing research extends this insight toward the design of evaluative infrastructures: frameworks that integrate descriptive, pluralist, and enactivist approaches into policy and institutional processes. By treating measurement as part of governance design, we can make explicit whose values are being reinforced, where accountability resides, and how evaluation criteria evolve alongside the systems they assess.
“What we choose to measure determines what AI becomes in practice.”
— PhD thesis: Coda — Measuring What We Enact
“Evaluation is not a side activity—it is how we come to know ourselves in relation to the machines we make.”
— PhD thesis: Coda — Measuring What We Enact
Collaboration across disciplines
The ethics of generative AI cannot be built in isolation. Progress depends on dialogue between computer scientists, philosophers, social scientists, and practitioners who study how technology shapes society. Respecting the depth of existing philosophical and humanities work, and combining it with technical insight, gives us the best chance of steering these systems responsibly.

Building ethical AI is not just a technical challenge but a collective practice of reflection, design, and responsibility. Collaboration across disciplines ensures that the systems we build remain accountable to the diverse human contexts in which they operate.


