Page 13 - The Annual AI Governance Report 2025 Steering the Future of AI
P. 13

The Annual AI Governance Report 2025: Steering the Future of AI



                      Theme 1: The Year of AI Agents


                  Since the end of 2024, the field of AI has grown to encompass not only chat-style tools, but
                  also so-called AI agents. Whereas earlier systems waited for a prompt, the new wave couples
                  large-language-model reasoning with tool-use scaffolds, letting software plan, decide and act
                  across multi-step workflows – booking trips, debugging code, even negotiating purchases – with
                  minimal human supervision. The rapid improvement in capabilities witnessed in what many
                  are now referring to as 'the year of AI agents' raises new questions about governance and safe
                  deployment.


                  1.1  Rapid Capability Improvements

                  Current AI agent capabilities. Since late-2024 leading labs have begun shipping production-
                  ready AI agents that can operate a full desktop or browser on the user’s behalf. AI agents have
                  evolved from simple task executors to complex systems capable of autonomous decision-
                  making and multi-step reasoning, with minimal human oversight. OpenAI’s Operator preview
                  books holidays, fills out forms and completes online purchases end-to-end, while Salesforce’s
                  AgentForce platform promises “a billion enterprise agents” by 2026.  These systems couple
                                                                                 2
                  large-language-model “brains” with tool-use scaffolding, letting them read screens, click buttons,
                  and call APIs—marking a qualitative jump from passive copilots to autonomous digital workers.

                  Performance benchmarks. Progress has been impressive, but it’s still patchy. On a benchmark
                  using real GitHub bugs, the best agent increased its success rate from fixing roughly one-third
                  of the problems to just over half. However, whenever the bug is difficult enough that a human
                  coder would require an hour or more to resolve it, the agent almost never finds the solution.
                                                                                                     3
                  The same pattern repeats outside coding. Household robot tests show agents completing
                  only about one job in seven, whereas people succeed almost every time. In broader autonomy
                  suites, where each task runs for an hour or longer, agents succeed fewer than one in five times.
                  In a tough web navigation test, they finish only about one mission in seven. The upside is that
                  the longest task an agent can complete is increasing rapidly–roughly doubling every seven
                  months–so their capabilities are improving, even if their performance is still inconsistent.
                                                                                                 4
                  Policy considerations. This steep capability curve gives policymakers a narrow window to act.
                  Researchers have shown that it is possible to boost scores on AI leaderboards just by using
                  more computing power or trying lots of options, rather than actually making the AI smarter. This
                  can hide the true costs and weaknesses of the systems; today’s benchmarks rarely track factors
                  such as energy use, bias or labour displacement. Forward-looking governance structures are
                  needed that pair technical scores with societal metrics – safety, fairness, ecological footprint and
                  economic impact – while requiring cost-controlled, reproducible evaluations. Forecasts suggest
                  human-level performance on key autonomy suites could arrive before 2027, so establishing
                  guard-rails now is prudent. 5







                  2   Kraprayoon, J. (2025, April 17). AI Agent Governance: A Field Guide. Institute for AI Policy and Strategy.
                  3   Kraprayoon, J. (2025, April 17), Fn. 1
                  4   Kraprayoon, J. (2025, April 17), Fn. 1
                  5   Gabriel, I., Manzini, A., Keeling, G., Hendricks, L. A., Rieser, V., Iqbal, H., Tomašev, N., Ktena, I., Kenton, Z.,
                     Rodriguez, M., El-Sayed, S., Brown, S., Akbulut, C., Trask, A., Hughes, E., Bergman, A. S., Shelby, R., Marchal,
                     N., Griffin, C., Manyika, J. (2024, April 24). The ethics of advanced AI assistants. arXiv.org.



                                                           4
   8   9   10   11   12   13   14   15   16   17   18