Page 34 - Proceedings of the 2017 ITU Kaleidoscope
P. 34
2017 ITU Kaleidoscope Academic Conference
that traditional anonymization techniques do not adequately the immediate focus of our recommendations and the
prevent the risk of re-identification of the data subject, thus framework suggested in the next section remains
leaving them vulnerable to similar threats as though they specifically on the release of aggregated summary statistics.
were explicitly identified. For instance, a study in United As discussed earlier, there could be various granular
States found that 87.1 percent of the people were uniquely statistics, like authentication volumes and error rates, about
identified by their combined five-digit ZIP code, birthdate the operation of the Aadhaar system that would help to
and sex (Sweeney, 2010 [13]). Another study re-identified evaluate the various programmes it is linked to and the
data subjects based purely on their movie preferences on operation of the system itself. Similarly, crucial information
Netflix (Arvind Narayanan et al, 2008 [16]). Thus, the about the demography is held by multiple entities, and
science of what data fields might lead to re-identification remains unknown to both government and the public. We
when combined with other fields (and even other available discussed gender-base split up of telecom subscribers and
databases) is an evolving one. That said, Paul Ohm offers a health care disbursements as some examples.
sobering conclusion in his research on anonymization and Another variation of aggregation could be interactive
re-identification - “Data can be either useful or perfectly techniques. Here, the data administrator (say, in this case,
anonymous but never both” (Ohm, 2012 [5]). In doing so, UIDAI, government departments, banks, telecom
the author highlights a necessary tension between the companies) answers specific questions about the dataset
usefulness of data disclosures and privacy interests. without releasing the underlying dataset. The RTI Act
Accordingly, in proposing a framework for open data allows individuals to make such queries to public
related to Aadhaar and its uses, we begin with the authorities, but the onus here would once again fall on
foundational principle that a person’s Aadhaar number or individuals or research groups, taking away from the
other PII can never constitute a part of an open dataset. principle of open data altogether. In addition, private
Even when such data is sought to be anonymized, it is companies are not included in its scope. Yet the interactive
critical to assess the risks of re-identification, and propose method might still be instructive. For example, if priority
privacy principles that minimize these risks. We do not areas for open data were identified in advance, then this
attempt a granular analysis of the re-identification risk in could act as a guide for the disclosures made subsequently.
the sharing of raw data possibilities from Aadhaar This is discussed further in the next section on
(although such an exercise would also be valuable). Instead, implementation.
we attempt to provide a heuristic by which to understand
these risks, and recommend some approaches versus others. 4.2. Monitoring and enforcement framework
In the following section we look at two such approaches:
1. Redacting "identifying information": This is the
process of redacting fields of information that are typically The requirements of the Aadhaar Act are implemented by
understood to identify individuals. In the case of, say, the UIDAI through regulations framed by it and the terms and
telecom subscriber database, this might include name, conditions stipulated in the agreements that it enters into
phone number and legally mandated confidential categories with authorized authentication and eKYC agencies. Further,
Section 23(2)(p) of the Aadhaar Act entitles UIDAI to
like Aadhaar number. For a researcher it might well be that
the existence of a unique identifier would allow far greater “appoint such committees as may be necessary to assist the
linkages and insights, particularly when comparing several Authority in discharge of its functions for the purposes of
telecom companies’ datasets. However, it is precisely this this Act”. Drawing from these instruments, we propose the
that would make individuals identifiable and vulnerable to following steps to create an implementation framework that
privacy threats, including from firms that seek to utilize this can leverage the existing provisions of the Aadhaar Act to
create an open data framework that is compatible with the
data for various purposes like marketing or promotions. principles suggested above.
Techniques like adding “noise” - variations at random to
the dataset - are being explored as potential solutions. The Step I: The preamble to the Aadhaar Act recognizes the
re-identification risk in any Aadhaar linked dataset, importance of good governance and efficiency, particularly
in the context of use of public resources. Recognizing the
including that of subscribers with only licensed service
area, gender and age, should also be subject to such importance of transparency and accountability as critical
rigorous assessments. tools of good governance, the government and UIDAI
should agree on the key priority areas around which
2. Releasing aggregate statistics: Ohm points to Aadhaar related open data needs to be built. Given the
another critical lesson - when PII is actually redacted from nature of data collected by UIDAI, gender, age and
the dataset, with minimal risk of re-identification, then the geographic location, would appear to be the logical choices.
release of the dataset on its own has little value for
research. In the telecom dataset example, the primary Step II: UIDAI should formulate a new set of regulations
to implement the Aadhaar open data policy, which would
insights would be aggregate statistics about total number of
male/female/transgender, as well as statistics relating to age include the creation of a multi-stakeholder open data
committee. The regulations will encode principles and
and licensed service area, and a combination of the three.
Therefore, the release of summary statistics, without processes for generating Aadhaar related open data. This
underlying full datasets, might be a preferred option. These process should be accompanied by a review and
data points could still prove too valuable for the purposes of amendment of existing regulations that might constrain
such use. For instance, the Aadhaar authentication
accountability, research and policy making. Accordingly,
– 18 –