Page 49 - AI Standards for Global Impact: From Governance to Action
P. 49

AI Standards for Global Impact: From Governance to Action



                   and charging speed. Note that this is a mix of performance, safety, security and convenience
                   indicators.

                   We also fully expect that these indicators of quality can be, and have been, independently
                   measured – be it by regulators, or by product testing organisations. We also fully expect that   Part 2: Thematic AI
                   these indicators of quality do not just reiterate compliance with regulations (after all, we assume
                   there will not be an illegal car on offer at the dealership), but rather measure characteristics that
                   go beyond the scope of what is addressed by regulation.

                   Contrast this with procuring an AI solution like a customer service chatbot handling insurance
                   claims or an automated landing software for a professional drone. In these cases, metrics are
                   much harder to come by. Where they exist, they usually refer to low-level technical capabilities
                   of the actual AI model rather than the characteristics of the whole system or solution. In many
                   cases, they do not exist at all. Procurement departments are faced with weighing little more
                   than marketing prose when comparing multiple solutions. Even before, they are struggling to
                   write clear quality measures into requests for quotation because there are no widely agreed
                   ways of measuring quality for AI solutions.

                   There is a closely related challenge: Suppose you are in the lucky position of having an in-house
                   AI engineering team. When should this team stop testing and refining an AI system? When is
                   the system “good enough” to be deployed? In fact, what does “good” even mean, and how
                   can you measure it?

                   To construct a useful set of quality indicators for when an AI system or solution is good enough
                   or which of two AI solutions is preferable, consider that a set of quality indicators might be called
                   a “quality index”. To develop this quality index for specific use cases – comprising different
                   indicators to evaluate different context and performance aspects, and compatible with common
                   AI governance frameworks – a common shared language is required to be able to compare
                   different AI solutions, or to decide when AI is “good enough” to go live.

                   AI systems increasingly operate in high-stakes domains, there is a need for rigorous testing
                   methodologies that extend beyond conventional software validation to address ethical, fairness,
                   and safety concerns intrinsic to machine learning models. The rapid evolution of generative AI
                   and autonomous agent systems introduces complex ethical questions related to challenges
                   including model unpredictability, bias amplification, and opaque decision pathways. Key testing
                   techniques include adversarial testing for generative AI models that uncover vulnerability
                   to prompt injection and hallucination, bias auditing frameworks in agent behaviours, and
                   explainability methods. The integration of Human-in-the-Loop (HITL) frameworks was
                   emphasized as a dynamic control layer—examples include human-validated reward models in
                   reinforcement learning agents and interactive prompt refinement cycles to keep generative AI
                   outputs within ethical boundaries.

                   Industry could implement AI governance frameworks incorporating continuous monitoring
                   for model versioning and explainability, coupled with automated compliance checks based
                   on regulatory standards such as GDPR and the EU AI Act. A key focus is the integration of
                   HITL processes, which enable real-time human oversight in AI decision-making workflows to
                   correct erroneous outputs, enforce domain-specific ethical constraints, and adapt to evolving
                   operational contexts.

                   Trustworthy AI testing needs to move beyond gatekeeper-style process checks focused solely
                   on model performance. Instead, we could validate core assumptions, conceptual soundness,



                                                            37
   44   45   46   47   48   49   50   51   52   53   54