Home

I acknowledge the traditional custodians of the lands on which I work, the Gadigal people of the Eora Nation and the Darkinjung people of the Central Coast. I pay respect to Elders past and present, and recognise this land as the site of the oldest continuing knowledge systems in the world. Close message to continue ×


I help organisations understand, evaluate and govern generative and agentic AI systems across prompts, tools, workflows, human judgement and institutional accountability.

PhD in AI Evaluation & Ethics | Former Google Research, Ethical AI | Published AI & Ethics author | Lead Guest Editor, AI Agents: Ethics, Safety and Governance

I work at the intersection of AI evaluation, responsible AI, governance and public communication. My work helps organisations move beyond generic AI principles toward practical systems for testing, accountability, oversight and assurance. Most AI evaluation still measures outputs. I focus on how systems behave in deployment.

AI evaluation
I design methods to assess how AI systems behave in real-world conditions, not just benchmark tasks or isolated outputs. This includes testing prompt sensitivity, workflow behaviour, and how systems perform across changing contexts.

AI governance
I map where responsibility, judgement and accountability move when AI enters organisational workflows. This includes prompts, tools, retrieval, memory, interface design, and evaluation metrics as governance layers.

Agentic AI risk
I help organisations prepare for systems that retrieve, route, act, remember and trigger downstream consequences. This includes evaluating behaviour over time, not just outputs, and designing oversight, logging and intervention points.

Most AI governance still focuses on models and outputs. In practice, risk and responsibility emerge through how systems are configured, used and evaluated. These core ideas provide a working framework for understanding AI as a dynamic, sociotechnical system rather than a static tool.

Evaluation is governance
What we choose to evaluate becomes what organisations optimise. Evaluation does not just measure systems. It shapes what becomes visible, rewarded and legitimate in practice.

The model is not the governance object
The deployed configuration is: model, prompts, tools, memory, retrieval, interface, workflow and oversight. The same model can create very different risks depending on how it is configured and used.

AI does not remove judgement
It relocates it into prompts, tools, defaults, metrics and institutional processes. The governance question is whether those relocated judgements are visible, contestable and accountable.

Outputs are residues
Outputs are the visible endpoints of path-dependent processes. To govern AI well, we need to understand the trajectory that produced them, not just the final answer.

These ideas shift AI governance from static outputs to dynamic systems. The focus moves to configuration, trajectory and evaluation as active governance layers, where behaviour is shaped, responsibility is distributed, and accountability must be designed rather than assumed.

Effective AI governance depends on how systems are evaluated. What organisations choose to measure shapes what gets optimised, surfaced and treated as acceptable behaviour.

In real deployments, the model itself is only one part of the system. Risk and accountability sit across the full configuration, including prompts, tools, retrieval, memory, interfaces, workflows and human oversight. The same underlying model can produce very different outcomes depending on how it is implemented.

AI does not eliminate human judgement. It redistributes it across system design choices such as prompts, defaults, metrics and operational processes. Governance therefore requires making those decision points visible and accountable.

Finally, outputs alone are not sufficient evidence. They are the end result of path-dependent processes. Understanding how a system arrives at an outcome, including its trajectory through prompts, tools and context, is critical for robust evaluation and oversight.

My PhD thesis, Measuring the Machine: Evaluating Generative AI as Pluralist Sociotechnical Systems, is now available on arXiv.

The thesis argues that AI evaluation is not neutral measurement. It shapes what models appear to be, what organisations optimise, and whose values become visible in practice.

Read the thesis: https://arxiv.org/abs/2604.20545

Dr Rebecca L. Johnson is an AI evaluation and governance expert whose work examines how generative and agentic AI systems behave in real-world sociotechnical contexts. She holds a PhD in AI Evaluation and Ethics from the University of Sydney and was formerly a researcher in Google Research’s Ethical AI team. Her work spans AI evaluation, pluralist benchmarking, cultural value drift, agentic AI governance, public policy and responsible AI practice.

See the Researcher page for a CV-snapshot, media bio, social media links, and to contact me. My blog contains a lot of easily digestible takes on AI Ethics.

I provide expert briefings, workshops and advisory support for organisations navigating generative AI governance, evaluation and accountability.

  • AI governance briefings
  • Agentic AI risk and accountability workshops
  • AI evaluation and assurance design
  • Responsible AI strategy
  • Public talks, panels and media commentary

For speaking, advisory or workshop enquiries, contact me here

Continue exploring: Research | Consulting | Media | Publications | Connect