Page 49 - AI Standards for Global Impact: From Governance to Action

P. 49

AI Standards for Global Impact: From Governance to Action

and charging speed. Note that this is a mix of performance, safety, security and convenience
indicators.

We also fully expect that these indicators of quality can be, and have been, independently
measured – be it by regulators, or by product testing organisations. We also fully expect that Part 2: Thematic AI
these indicators of quality do not just reiterate compliance with regulations (after all, we assume
there will not be an illegal car on offer at the dealership), but rather measure characteristics that
go beyond the scope of what is addressed by regulation.

Contrast this with procuring an AI solution like a customer service chatbot handling insurance
claims or an automated landing software for a professional drone. In these cases, metrics are
much harder to come by. Where they exist, they usually refer to low-level technical capabilities
of the actual AI model rather than the characteristics of the whole system or solution. In many
cases, they do not exist at all. Procurement departments are faced with weighing little more
than marketing prose when comparing multiple solutions. Even before, they are struggling to
write clear quality measures into requests for quotation because there are no widely agreed
ways of measuring quality for AI solutions.

There is a closely related challenge: Suppose you are in the lucky position of having an in-house
AI engineering team. When should this team stop testing and refining an AI system? When is
the system “good enough” to be deployed? In fact, what does “good” even mean, and how
can you measure it?

To construct a useful set of quality indicators for when an AI system or solution is good enough
or which of two AI solutions is preferable, consider that a set of quality indicators might be called
a “quality index”. To develop this quality index for specific use cases – comprising different
indicators to evaluate different context and performance aspects, and compatible with common
AI governance frameworks – a common shared language is required to be able to compare
different AI solutions, or to decide when AI is “good enough” to go live.

AI systems increasingly operate in high-stakes domains, there is a need for rigorous testing
methodologies that extend beyond conventional software validation to address ethical, fairness,
and safety concerns intrinsic to machine learning models. The rapid evolution of generative AI
and autonomous agent systems introduces complex ethical questions related to challenges
including model unpredictability, bias amplification, and opaque decision pathways. Key testing
techniques include adversarial testing for generative AI models that uncover vulnerability
to prompt injection and hallucination, bias auditing frameworks in agent behaviours, and
explainability methods. The integration of Human-in-the-Loop (HITL) frameworks was
emphasized as a dynamic control layer—examples include human-validated reward models in
reinforcement learning agents and interactive prompt refinement cycles to keep generative AI
outputs within ethical boundaries.

Industry could implement AI governance frameworks incorporating continuous monitoring
for model versioning and explainability, coupled with automated compliance checks based
on regulatory standards such as GDPR and the EU AI Act. A key focus is the integration of
HITL processes, which enable real-time human oversight in AI decision-making workflows to
correct erroneous outputs, enforce domain-specific ethical constraints, and adapt to evolving
operational contexts.

Trustworthy AI testing needs to move beyond gatekeeper-style process checks focused solely
on model performance. Instead, we could validate core assumptions, conceptual soundness,

44 45 46 47 48 49 50 51 52 53 54