Page 90 - ITU KALEIDOSCOPE, ATLANTA 2019
P. 90
2019 ITU Kaleidoscope Academic Conference
and in-house test results are reported in a scientific paper that ML/AI evaluation are still missing. Standardization bodies
is reviewed by peers for publication in a journal or have merely started to address health ML/AI technologies (cf.
conference proceeding. Occasional open-source releases of section 2). Principles for prediction models, software and
the software code can allow the reviewers and other peers to digital health technologies can provide some overall
reproduce the results, in principle. Yet, the model orientation (section 3), but can only serve as a starting point,
performance is evaluated in-house only in many cases, e.g. and need to be transferred to the characteristics of the novel
because the code/model is not published or because of legal technologies. State-of-the-art procedures for ML/AI
or other barriers to share the test data. Therefore, it remains performance evaluation are a sound foundation, but the
unclear if the evaluation was conducted properly, if common limits discussed in section 4 need to be addressed.
pitfalls were avoided [cf. 56] such as leakage between test
and training data, or if the test data set was (un)intentionally The mission of the ITU/WHO focus group on “AI for Health”
curated, which can all result in overestimating the model is to undertake crucial steps towards evaluation standards
performance and in spurious results. Performance reports of that are applicable on a global scale, an approach that offers
different models can often not be compared, because of substantial potential for synergies. A large number of
individual data preprocessing and filtering. This problem is national regulatory institutions, public health institutes,
even more severe for commercial AI developers that physicians, patients, developers, health insurances, licensees,
typically refrain from publishing details of their methods or hospitals and other decision-makers around the globe can
the code [57]. profit from a common, standardized benchmarking
framework for health ML/AI. Standards live on being
For a range of tasks, human experts are required to label or sustained by a broad community. Therefore, the focus group
annotate the test data. In fact, experts can disagree, which is creating an ecosystem of diverse stakeholders from
leads to questions related to the so called “ground truth” or industry, academia, regulation, and policy with a common,
“gold standard”. How many experts of which level of substantial interest in health ML/AI benchmarking. ITU and
expertise [57] need to be asked? Crucially, in-house test data WHO officials monitor and document the overall process.
are often very similar to the training data, e.g. when Since its foundation in July 2018, the focus group has been
originating from the same measurement device, due to organizing a series of free workshops with subsequent multi-
practical reasons (cost, time, access and legal hurdles). day meetings in Europe, North America, Asia, Africa and
Therefore, the capacity of the AI to generalize to potentially India (and South America in January 2020) every two or
different, previously unseen data is often unclear, e.g. to data three months for engaging the regional communities.
from other laboratories, hospitals, regions or countries [cf. Participation in the focus group is encouraged by attending
58]. the events on site or via the Internet remotely. In addition,
further virtual collaboration allows for carrying forward
Researchers from the medical and machine learning work in between meetings. These online participation
communities are aware of these open questions and problems. possibilities and the generous support from a charitable
The medical journal “The Lancet - Digital Health” sets a foundation, with travel grants for priority regions, foster the
good example and requires “independent validation for all global participation, considering time and resource
AI studies that screen, treat, or diagnose disease” [59]. constraints.
Machine learning scientists urge towards reproducibility and
replicability by organizing challenges (also known as The structure of FG-AI4H is shown below, Figure 1. Two
competitions), where an independent, neutral arbiter types of sub-groups are generating the main deliverables:
evaluates the AI on a separate test data set [e.g. 60]. These working groups (WGs) and topic groups (TGs).
challenges are conducted at scientific conferences (e.g.
NeurIPS, MICCAI, CVPR, SPIE, etc.) and on Internet
platforms (e.g. Kaggle, AIcrowd, EvalAI, DREAM
Challenges, Grand Challenge etc.). Challenge design is not
trivial and research shows that many design decisions can
have a large impact on the benchmarking outcome [61].
Aspects beyond mere performance have not been addressed
sufficiently so far, including the benchmarking of robustness
[62], and of uncertainty [63], which is important for the
practical application in healthcare. Moreover, further in-
depth discussions with domain experts, e.g. physicians, are
required, in order to find out if the used evaluation metrics
are actually relevant with meaningful (clinical) endpoints Figure 1 – Structure of FG-AI4H
[64].
5. ITU/WHO FOCUS GROUP ON AI FOR HEALTH WGs consider matters such as data and AI solution handling,
assessment methods, health requirements, operations, and
regulatory considerations. Many of these matters are cross-
While there is considerable experience and previous work to cutting subjects that affect a specific aspect of an AI for
build upon, generally accepted, impartial standards for health
– 70 –