Page 90 - ITU KALEIDOSCOPE, ATLANTA 2019
P. 90

2019 ITU Kaleidoscope Academic Conference




           and in-house test results are reported in a scientific paper that   ML/AI evaluation are still missing. Standardization bodies
           is  reviewed  by  peers  for  publication  in  a  journal  or   have merely started to address health ML/AI technologies (cf.
           conference proceeding. Occasional open-source releases of   section  2).  Principles  for  prediction  models,  software  and
           the software code can allow the reviewers and other peers to   digital  health  technologies  can  provide  some  overall
           reproduce  the  results,  in  principle.  Yet,  the  model   orientation (section 3), but can only serve as a starting point,
           performance is evaluated in-house only in many cases, e.g.   and need to be transferred to the characteristics of the novel
           because the code/model is not published or because of legal   technologies.  State-of-the-art  procedures  for  ML/AI
           or other barriers to share the test data. Therefore, it remains   performance  evaluation  are  a  sound  foundation,  but  the
           unclear if the evaluation was conducted properly, if common   limits discussed in section 4 need to be addressed.
           pitfalls were avoided [cf. 56] such as leakage between test
           and training data, or if the test data set was (un)intentionally   The mission of the ITU/WHO focus group on “AI for Health”
           curated,  which  can  all  result  in  overestimating  the  model   is  to  undertake  crucial  steps  towards  evaluation  standards
           performance and in spurious results. Performance reports of   that are applicable on a global scale, an approach that offers
           different  models  can  often  not  be  compared,  because  of   substantial  potential  for  synergies.  A  large  number  of
           individual data preprocessing and filtering. This problem is   national  regulatory  institutions,  public  health  institutes,
           even  more  severe  for  commercial  AI  developers  that   physicians, patients, developers, health insurances, licensees,
           typically refrain from publishing details of their methods or   hospitals  and  other  decision-makers  around  the  globe  can
           the code [57].                                     profit  from  a  common,  standardized  benchmarking
                                                              framework  for  health  ML/AI.  Standards  live  on  being
           For a range of tasks, human experts are required to label or   sustained by a broad community. Therefore, the focus group
           annotate the test data. In fact, experts can disagree, which   is  creating  an  ecosystem  of  diverse  stakeholders  from
           leads to questions related to the so called “ground truth” or   industry, academia, regulation, and policy with a common,
           “gold  standard”.  How  many  experts  of  which  level  of   substantial interest in health ML/AI benchmarking. ITU and
           expertise [57] need to be asked? Crucially, in-house test data   WHO officials monitor and document the overall process.
           are  often  very  similar  to  the  training  data,  e.g.  when   Since its foundation in July 2018, the focus group has been
           originating  from  the  same  measurement  device,  due  to   organizing a series of free workshops with subsequent multi-
           practical  reasons  (cost,  time,  access  and  legal  hurdles).   day meetings in Europe, North America, Asia, Africa and
           Therefore, the capacity of the AI to generalize to potentially   India  (and  South  America  in  January  2020)  every  two  or
           different, previously unseen data is often unclear, e.g. to data   three  months  for  engaging  the  regional  communities.
           from other laboratories, hospitals, regions or countries [cf.   Participation in the focus group is encouraged by attending
           58].                                               the events on site or via the Internet remotely. In addition,
                                                              further  virtual  collaboration  allows  for  carrying  forward
           Researchers  from  the  medical  and  machine  learning   work  in  between  meetings.  These  online  participation
           communities are aware of these open questions and problems.  possibilities  and  the  generous  support  from  a  charitable
           The  medical  journal  “The Lancet  -  Digital  Health”  sets  a   foundation, with travel grants for priority regions, foster the
           good example and requires “independent validation for all   global  participation,  considering  time  and  resource
           AI  studies  that  screen,  treat,  or  diagnose  disease”  [59].   constraints.
           Machine learning scientists urge towards reproducibility and
           replicability  by  organizing  challenges  (also  known  as   The structure of FG-AI4H is shown below, Figure 1. Two
           competitions),  where  an  independent,  neutral  arbiter   types  of  sub-groups  are  generating  the  main  deliverables:
           evaluates the AI on a separate test data set [e.g. 60]. These   working groups (WGs) and topic groups (TGs).
           challenges  are  conducted  at  scientific  conferences  (e.g.
           NeurIPS,  MICCAI,  CVPR,  SPIE,  etc.)  and  on  Internet
           platforms  (e.g.  Kaggle,  AIcrowd,  EvalAI,  DREAM
           Challenges, Grand Challenge etc.). Challenge design is not
           trivial and research shows that many design decisions can
           have  a  large  impact  on  the  benchmarking  outcome  [61].
           Aspects beyond mere performance have not been addressed
           sufficiently so far, including the benchmarking of robustness
           [62],  and  of  uncertainty  [63],  which  is  important  for  the
           practical  application  in  healthcare.  Moreover,  further  in-
           depth discussions with domain experts, e.g. physicians, are
           required, in order to find out if the used evaluation metrics
           are  actually  relevant  with  meaningful  (clinical)  endpoints   Figure 1 – Structure of FG-AI4H
           [64].

           5.  ITU/WHO FOCUS GROUP ON AI FOR HEALTH           WGs consider matters such as data and AI solution handling,
                                                              assessment  methods,  health  requirements,  operations,  and
                                                              regulatory considerations. Many of these matters are cross-
           While there is considerable experience and previous work to   cutting  subjects  that  affect  a  specific  aspect  of  an  AI  for
           build upon, generally accepted, impartial standards for health




                                                           – 70 –
   85   86   87   88   89   90   91   92   93   94   95