Page 89 - ITU KALEIDOSCOPE, ATLANTA 2019
P. 89
ICT for Health: Networks, standards and innovation
Deutsches Institut für Normung (DIN) began drafting an “AI Typically, AI serves as a multivariable prediction model that
roadmap” in May 2019 “to create a framework for action for maps multidimensional input variables to one or
standardization” [42]. DIN has also founded an multidimensional output variables, e.g. pictures to disease
interdisciplinary AI Working Committee [43] and is working classification codes. Accordingly, the TRIPOD statement for
on two DIN SPECs related to AI [44, 45]. the “transparent reporting of a multivariable prediction
model for individual prognosis or diagnosis” can serve as a
Large companies lead the field in the area of AI and have landmark for AI methods too. These guidelines have been
started joint activities on safe AI, which potentially can published by the EQUATOR Network, an organization
establish de-facto standards fast. The “Partnership on aiming to enhance the quality and transparency of health
Artificial Intelligence to Benefit People and Society” is led research [49, 50, 51]. Cf. [52] for a discussion about how the
by representatives from large technology firms and several TRIPOD statement relates to AI.
other member organizations, also from academia and civil
society. The first goal of this initiative is “to develop and ML/AI models are implemented as pieces of software and
share best-practice methods and approaches in the research, hence belong to digital technologies in almost all cases (in
development, testing, and fielding of AI technologies”. This principle, they can be analogue hardware, too [53]). The
includes addressing “the trustworthiness, reliability, International Medical Device Regulators Forum has outlined
containment, safety, and robustness of the technology”. They principles for the clinical evaluation of software as a medical
are particularly interested in “safety-critical application areas” device in a draft from 2017 [54]. Three main topics structure
and mention healthcare as an example [46]. this clinical evaluation process: (a) Assuring that there is a
“valid clinical association” between the software output and
The “OpenAI” research center, which is well known in the the “targeted clinical condition”. (b) Correct processing of
ML/AI research community and backed by large investors, the “input data to generate accurate, reliable, and precise
has recently published a policy paper on “the role of output data”. (c) Achieving the “intended purpose in your
cooperation in responsible AI development”, “across target population in the context of clinical care” using the
organizational and national borders”, discussing “joint software output data. The English National Institute for
research into the formal verification of AI systems’ Health and Care Excellence (NICE) has published an
capabilities and other aspects of AI safety”. In particular, “evidence standards framework for digital health
they mention “various applied ‘AI for good’ projects whose technologies” in March 2019 [55]. This document
results might have wide ranging and largely positive “describes standards for the evidence (…) of effectiveness
applications (e.g. in domains like [...] health); coordinating relevant to the intended use(s) of the technology”. Moreover,
on the use of particular benchmarks; joint creation and the document states that the framework is applicable to
sharing of datasets that aid in safety research”. Moreover, digital health technologies “that incorporate artificial
they raise the question of the role of “standardization bodies intelligence using fixed algorithms”, excluding adaptive AI
in resolving collective action problems between companies”, algorithms.
in particular internationally [47]. OpenAI claims, “AI
companies can work to develop industry norms and 4. ML/AI PERFORMANCE EVALUATION
standards that ensure systems are developed and released
only if they are safe, and can agree to invest resources in The ML/AI models are expected to return meaningful results
safety during development and meet appropriate standards that are accurate, plausible and reliable, when processing
prior to release”. They “anticipate that identifying similar completely novel data points that the model has never seen
mechanisms to improve cooperation on AI safety between before, during the actual usage in the “real world”. Out-of-
states and with other non-industry actors will be of sample tests make it possible to assess this capability to some
increasing importance in the years to come” [48]. degree, if the tests are conducted appropriately. These tests
can be largely conducted in silico, at least as a first step,
3. VALIDATING DIGITAL HEALTH without posing the potential hazards of clinical trials, by
TECHNOLOGIES confronting the model with previously recorded test samples,
and by comparing the model output with the “ground truth”
Previous work can provide orientation for future for the respective task. This characteristic allows conducting
international standards for the validation of novel ML/AI- systematic tests at large scale (e.g. using databases with
based health technologies. Physicians, regulators, scientists thousands of MRT images), replicable and fast (e.g. in the
and engineers have long-ranging experience in dealing with case of software updates, or adaptive algorithms).
complex safety-critical health interventions and technologies
that require careful validation checks prior to usage. These The machine learning community evaluates the performance
technologies include, for instance, clinical interventions, of ML/AI models usually as follows: First, the model is
surgical procedures, pharmaceutics, medical devices and tested out-of-sample, but in-house, by splitting the available
software. Randomized controlled clinical trials, peer-review data in a training and a test set, often in a cross-validation
of scientific literature and standard tests in accredited testing scheme. The trained model computes labels or other output
laboratories are examples of well-established methods for variables from the input data of the test set, which are
assessing these interventions, substances or devices. statistically compared with the “true” labels or annotations
(the comparison is summarized in a score). Then, method
– 69 –