Page 52 - ITU Journal - ICT Discoveries - Volume 1, No. 2, December 2018 - Second special issue on Data for Good
P. 52

ITU JOURNAL: ICT Discoveries, Vol. 1(2), December 2018




          Note that our analysis also shows the potential for   Secondly,  approaches  using  supervised  machine
          obtaining  subnational  estimates  for  stocks  of   learning such  as regression  models treat the
          migrants.  While  countries  like  the  United  States   (biased)  data  merely  as  a  signal  to  predict  a
          have  good  migration  stocks  data  at  both  the   particular  quantity  of  interest,  e.g.  stocks  of
          national and the subnational level, other countries,   migrants [27].  As long as the  signal  has  high
          especially  less  developed  ones,  might  only  have   predictive power, it has potential value for the task.
          reliable  data  at  the  national  level  and  could   For  such  approaches,  selection  bias  is  only  a
          potentially benefit from our approach for creating   challenge  when  it  is  non-systematic,  e.g.  when
          spatially disaggregated estimates.                   different  countries  exhibit  different  mechanisms
                                                               underlying  the  selection  bias  and  where  these
          While our current research  focuses on improving     mechanisms  cannot  be  understood  and  modeled,
          estimates for stocks of migrants at the traditional   and  hence  corrected  for,  through  available  data.
          time scale of one year, similar approaches also hold   Nevertheless,  machine-learning  models  when
          promise  for  monitoring  short-term  flows  of      applied  to  social  and  demographic  outcomes  can
          migrants. As an example, recent data on Facebook     produce predictions that are inconsistent with prior
          users  in  Spain  [25]  showed  a  surprisingly  high   domain  knowledge.  For  example,  in  the  case  of
          number of likely migrants from Venezuela, higher     demographic rates, certain empirical regularities by
          than the numbers  in the  most recent official       age and gender are well documented. Age-specific
          statistics, but very plausible given the current crisis   migration  rates  tend  to  peak  in  early  adulthood.
          in Venezuela.                                        Combining substantive and theoretical knowledge
                                                               about the drivers of migration, or patterns of gender
          5.   LIMITATIONS AND CHALLENGES                      equality, with machine-learning based approaches
                                                               is a challenge that is important to address. A further
          When deriving any type of insights from online data   challenge  also  emerges  when  several,  different
          the question of data bias naturally comes up. Here   estimates  for  the  quantity  to  be  predicted,  or  in
          we discuss two of the most important types of bias,   other words, different ground truth measures, are
          as well as other limitations and challenges related   available,  which  come  from  different  surveys  or
          to using online advertising audience estimates for   other data sources. In this case, applying methods to
          monitoring global development.                       combine  and  generate  estimates  that  account  for
                                                               different  types  of  error  and  uncertainty  across
          Arguably  the  most  frequently  cited  bias  is  the   different  data  sources  is  an  important  area  for
          selection  bias  where,  say,  insights  derived  from   future research.
          online  data  are  more  likely  to  represent  the
          behavior  and  the  needs  of  the  well-off  than  the   Another more fundamental type of bias affecting a
          people most in need as the latter are less likely to   lot of data for development research is the so-called
          contribute to the data in the first place. Irrespective   “streetlight  effect”  where  a  particular  issue  is
          of  economic  status,  people  not  on  social  media   studied not because it is the most pressing issue, but
          would not contribute to the generation of the data   because data is readily available. This type of bias
          and hence the analysis. This can lead to distorted   certainly  does  extend  to  advertising  audience
          results  such  as  the  finding  that  in  countries  with   estimates  which,  say,  are  unlikely  to  provide
          fewer women than men on Google+, those women         important insights for improving statistics on Goal
          who are online have a higher online status than the   #14, life below water. Our work does by no means
          men [26].                                            advocate stopping the collection of traditional data
                                                               or stop collecting data on pressing issues with poor
          However, this type of bias and data distortion is not   data availability.
          necessarily problematic. Firstly, it is often exactly
          the missing data that is the signal. For example, in   More  fundamentally,  the  streetlight  effect  is  also
          our  research  on  gender  gaps  it  is  the  fact  that   related  to  the  problem  that  not  everything  that
          women are not found in the data at the same rate as   counts is countable.  Though we certainly believe
                                                                                  8
          men that provides a signal on gender inequalities.   that better data can contribute to better solutions,

          8  Also see the article in https://www.theguardian.com/global-
          development-professionals-network/2014/dec/17/data-
          revolution-limitations-in-images  for  a  discussion  of  general
          shortcomings of the data revolution vision.




            30                               © International Telecommunication Union, 2018
   47   48   49   50   51   52   53   54   55   56   57