Page 52 - ITU Journal - ICT Discoveries - Volume 1, No. 2, December 2018 - Second special issue on Data for Good
P. 52
ITU JOURNAL: ICT Discoveries, Vol. 1(2), December 2018
Note that our analysis also shows the potential for Secondly, approaches using supervised machine
obtaining subnational estimates for stocks of learning such as regression models treat the
migrants. While countries like the United States (biased) data merely as a signal to predict a
have good migration stocks data at both the particular quantity of interest, e.g. stocks of
national and the subnational level, other countries, migrants [27]. As long as the signal has high
especially less developed ones, might only have predictive power, it has potential value for the task.
reliable data at the national level and could For such approaches, selection bias is only a
potentially benefit from our approach for creating challenge when it is non-systematic, e.g. when
spatially disaggregated estimates. different countries exhibit different mechanisms
underlying the selection bias and where these
While our current research focuses on improving mechanisms cannot be understood and modeled,
estimates for stocks of migrants at the traditional and hence corrected for, through available data.
time scale of one year, similar approaches also hold Nevertheless, machine-learning models when
promise for monitoring short-term flows of applied to social and demographic outcomes can
migrants. As an example, recent data on Facebook produce predictions that are inconsistent with prior
users in Spain [25] showed a surprisingly high domain knowledge. For example, in the case of
number of likely migrants from Venezuela, higher demographic rates, certain empirical regularities by
than the numbers in the most recent official age and gender are well documented. Age-specific
statistics, but very plausible given the current crisis migration rates tend to peak in early adulthood.
in Venezuela. Combining substantive and theoretical knowledge
about the drivers of migration, or patterns of gender
5. LIMITATIONS AND CHALLENGES equality, with machine-learning based approaches
is a challenge that is important to address. A further
When deriving any type of insights from online data challenge also emerges when several, different
the question of data bias naturally comes up. Here estimates for the quantity to be predicted, or in
we discuss two of the most important types of bias, other words, different ground truth measures, are
as well as other limitations and challenges related available, which come from different surveys or
to using online advertising audience estimates for other data sources. In this case, applying methods to
monitoring global development. combine and generate estimates that account for
different types of error and uncertainty across
Arguably the most frequently cited bias is the different data sources is an important area for
selection bias where, say, insights derived from future research.
online data are more likely to represent the
behavior and the needs of the well-off than the Another more fundamental type of bias affecting a
people most in need as the latter are less likely to lot of data for development research is the so-called
contribute to the data in the first place. Irrespective “streetlight effect” where a particular issue is
of economic status, people not on social media studied not because it is the most pressing issue, but
would not contribute to the generation of the data because data is readily available. This type of bias
and hence the analysis. This can lead to distorted certainly does extend to advertising audience
results such as the finding that in countries with estimates which, say, are unlikely to provide
fewer women than men on Google+, those women important insights for improving statistics on Goal
who are online have a higher online status than the #14, life below water. Our work does by no means
men [26]. advocate stopping the collection of traditional data
or stop collecting data on pressing issues with poor
However, this type of bias and data distortion is not data availability.
necessarily problematic. Firstly, it is often exactly
the missing data that is the signal. For example, in More fundamentally, the streetlight effect is also
our research on gender gaps it is the fact that related to the problem that not everything that
women are not found in the data at the same rate as counts is countable. Though we certainly believe
8
men that provides a signal on gender inequalities. that better data can contribute to better solutions,
8 Also see the article in https://www.theguardian.com/global-
development-professionals-network/2014/dec/17/data-
revolution-limitations-in-images for a discussion of general
shortcomings of the data revolution vision.
30 © International Telecommunication Union, 2018