Page 92 - ITU KALEIDOSCOPE, ATLANTA 2019
P. 92
2019 ITU Kaleidoscope Academic Conference
references (e.g. annotated images) to enable the developers [29] and a white paper on the website, where also the full
carrying out a trial run of their code. documentation of all previous meetings is published.
For a clean and fair evaluation, a trusted third party should 6. OUTLOOK
receive the trained model, as independent arbiter, and
conduct the tests on data that have never been published In summary, the ITU/WHO focus group on “AI for Health”
before. This cautious procedure prevents unfair conduct, e.g. has taken the first exploratory steps towards international
tuning the model for optimal performance on this particular health ML/AI evaluation standards. For the future, we expect
test set (“overfitting”), without actually being able to that a wide spectrum of health ML/AI topics will be
generalize well to real-world data, which can be expected in addressed and that insights from the evaluation will be
practice. Therefore, widely available, public data sets cannot brought back to research and development. The evaluation
be used for the evaluation and the entire test data set must procedure will be continuously refined in a repeated cycle,
remain secret, i.e. neither labeled nor unlabeled test data considering further quality criteria beyond mere
should be made available. The model performance should be performance, and including high quality test data with
evaluated in a closed computing environment without increasing geographic coverage. For the years to come, we
Internet access. Otherwise, test data could be leaked, against also anticipate further deepening of cooperation on ML/AI
the rules, and the model be tweaked on the test data. Besides, between standard setting organizations. While the
leaderboard probing and other potential pitfalls known from standardization activities on ML/AI differ in their thematic
ML challenges must be kept in mind [72, 73]. The trusted scope and particular objective (see section 2), they can profit
third party is responsible to protect both test data and ML/AI from collaboration, because different application areas of
model. The test data have to remain secret for subsequent ML/AI often share problems and data modalities. For
meaningful testing and the AI models may contain business- instance, assuring robust automatic image interpretation can
relevant trade secrets of the developer. be relevant for a range of safety-critical application domains,
and is not limited to healthcare. At the same time, a generic
In this spirit, focus group members have conducted a first approach is often not possible, because the cross-sectional
proof-of-concept benchmark for digital pathology, where an ML/AI technologies require cooperation with the respective
ML/AI model can provide diagnostic support by quantifying domain experts. A good example for this multidisciplinary
tumor infiltrating lymphocytes in breast cancer, from whole cooperation is the joint focus group of ITU and WHO, which
slide histopathology images, which is relevant for prognosis brings together expertise from information technology and
and therapy selection [cf. 74, 75]. The topic group had health standardization bodies. In particular, this initiative
defined the evaluation task and procedure, and had acquired shows that global collaboration can leverage synergy effects,
and annotated test data. The developer had trained a model since many relevant issues are common across the world.
on own training data to predict the annotations that a
pathologist would give from the images. A focus group REFERENCES
member as arbiter provided the computing infrastructure
according to specifications of the developer (here a desktop [1] World Health Organization (2019) Global Strategy
computer with a certain graphics processing unit, operating on Digital Health 2020-2024. Retrieved from
system, package manager, and ML framework installed) and https://www.who.int/DHStrategy
granted the developer access via the Internet to install the
prediction routine. Few annotated example data enabled the [2] U.S. Food and Drug Administration (2018). FDA
developer to test the prediction routine. After disconnecting News Release. Retrieved from
the computer from the Internet, the arbiter uploaded https://www.fda.gov/news-events/press-
undisclosed test data, directly received from the topic group, announcements/fda-permits-marketing-artificial-
on the machine, and executed the prediction routine, which intelligence-based-device-detect-certain-diabetes-
processed the data and predicted the annotations. Finally, related-eye
scores (true positive rate and true negative rate) were
computed by comparison with the reference annotations, and [3] Mesko, B. (2019). FDA Approvals For Smart
reported back to the topic group and the developer. Naturally, Algorithms In Medicine In One Giant Infographic.
this manual procedure can be automatized and scaled, e.g. The Medical Futurist. Retrieved from
with one of the ML challenge frameworks mentioned in https://medicalfuturist.com/fda-approvals-for-
section 4, potentially installed on a server on ITU or UN algorithms-in-medicine
premises.
[4] U. S. Food and Drug Administration (2019).
Interaction with further health institutions will be Proposed Regulatory Framework for Modifications
strengthened, potentially, e.g., with the International to Artificial Intelligence/Machine Learning
Association of National Public Health Institutes, the (AI/ML)-Based Software as a Medical Device
InterAcademy Partnership and the World Health Summit. (SaMD) - Discussion Paper and Request for
Further information about the scope and general process of Feedback. Retrieved from
the focus group can be found in a commentary in The Lancet https://www.regulations.gov/document?D=FDA-
2019-N-1185-0001
– 72 –