Page 13 - The Annual AI Governance Report 2025 Steering the Future of AI

P. 13

The Annual AI Governance Report 2025: Steering the Future of AI

Theme 1: The Year of AI Agents

Since the end of 2024, the field of AI has grown to encompass not only chat-style tools, but
also so-called AI agents. Whereas earlier systems waited for a prompt, the new wave couples
large-language-model reasoning with tool-use scaffolds, letting software plan, decide and act
across multi-step workflows – booking trips, debugging code, even negotiating purchases – with
minimal human supervision. The rapid improvement in capabilities witnessed in what many
are now referring to as 'the year of AI agents' raises new questions about governance and safe
deployment.

1.1 Rapid Capability Improvements

Current AI agent capabilities. Since late-2024 leading labs have begun shipping production-
ready AI agents that can operate a full desktop or browser on the user’s behalf. AI agents have
evolved from simple task executors to complex systems capable of autonomous decision-
making and multi-step reasoning, with minimal human oversight. OpenAI’s Operator preview
books holidays, fills out forms and completes online purchases end-to-end, while Salesforce’s
AgentForce platform promises “a billion enterprise agents” by 2026. These systems couple
2
large-language-model “brains” with tool-use scaffolding, letting them read screens, click buttons,
and call APIs—marking a qualitative jump from passive copilots to autonomous digital workers.

Performance benchmarks. Progress has been impressive, but it’s still patchy. On a benchmark
using real GitHub bugs, the best agent increased its success rate from fixing roughly one-third
of the problems to just over half. However, whenever the bug is difficult enough that a human
coder would require an hour or more to resolve it, the agent almost never finds the solution.
3
The same pattern repeats outside coding. Household robot tests show agents completing
only about one job in seven, whereas people succeed almost every time. In broader autonomy
suites, where each task runs for an hour or longer, agents succeed fewer than one in five times.
In a tough web navigation test, they finish only about one mission in seven. The upside is that
the longest task an agent can complete is increasing rapidly–roughly doubling every seven
months–so their capabilities are improving, even if their performance is still inconsistent.
4
Policy considerations. This steep capability curve gives policymakers a narrow window to act.
Researchers have shown that it is possible to boost scores on AI leaderboards just by using
more computing power or trying lots of options, rather than actually making the AI smarter. This
can hide the true costs and weaknesses of the systems; today’s benchmarks rarely track factors
such as energy use, bias or labour displacement. Forward-looking governance structures are
needed that pair technical scores with societal metrics – safety, fairness, ecological footprint and
economic impact – while requiring cost-controlled, reproducible evaluations. Forecasts suggest
human-level performance on key autonomy suites could arrive before 2027, so establishing
guard-rails now is prudent. 5

2 Kraprayoon, J. (2025, April 17). AI Agent Governance: A Field Guide. Institute for AI Policy and Strategy.
3 Kraprayoon, J. (2025, April 17), Fn. 1
4 Kraprayoon, J. (2025, April 17), Fn. 1
5 Gabriel, I., Manzini, A., Keeling, G., Hendricks, L. A., Rieser, V., Iqbal, H., Tomašev, N., Ktena, I., Kenton, Z.,
Rodriguez, M., El-Sayed, S., Brown, S., Akbulut, C., Trask, A., Hughes, E., Bergman, A. S., Shelby, R., Marchal,
N., Griffin, C., Manyika, J. (2024, April 24). The ethics of advanced AI assistants. arXiv.org.

8 9 10 11 12 13 14 15 16 17 18