Page 34 - Proceedings of the 2017 ITU Kaleidoscope
P. 34

2017 ITU Kaleidoscope Academic Conference




           that traditional anonymization techniques do not adequately   the immediate  focus of our recommendations and the
           prevent the risk of re-identification of the data subject, thus   framework suggested in  the next  section remains
           leaving  them  vulnerable to  similar threats as though they   specifically on the release of aggregated summary statistics.
           were explicitly identified. For instance, a study  in United   As discussed earlier, there could be  various  granular
           States found that 87.1 percent of the people were uniquely   statistics, like authentication volumes and error rates, about
           identified by their combined five-digit ZIP code, birthdate   the operation of the Aadhaar system  that  would help to
           and sex (Sweeney, 2010 [13]). Another study re-identified   evaluate the various programmes it is linked to and the
           data subjects based purely on their  movie preferences on   operation of the system itself. Similarly, crucial information
           Netflix (Arvind Narayanan  et al, 2008 [16]).  Thus,  the   about the demography  is  held by  multiple entities, and
           science of  what data fields  might lead to re-identification   remains unknown to both government and the public. We
           when combined with other fields (and even other available   discussed gender-base split  up of telecom subscribers and
           databases) is an evolving one. That said, Paul Ohm offers a   health care disbursements as some examples.
           sobering conclusion in his research on anonymization and   Another  variation of aggregation could be interactive
           re-identification  -  “Data can be either useful or perfectly   techniques. Here, the data administrator (say, in this case,
           anonymous but never both” (Ohm, 2012 [5]). In doing so,   UIDAI,  government  departments,  banks,  telecom
           the author  highlights a necessary  tension between the   companies) answers  specific questions about the dataset
           usefulness of data disclosures and privacy interests.   without releasing  the underlying dataset. The  RTI Act
           Accordingly, in proposing  a framework  for open data   allows  individuals to  make such  queries  to public
           related to Aadhaar and its uses,  we begin  with the   authorities, but the onus here  would once again fall on
           foundational principle that a  person’s  Aadhaar  number or   individuals or research groups, taking away  from  the
           other PII can never constitute a part of an open dataset.   principle of open data altogether. In addition, private
           Even  when  such data is sought to be anonymized, it is   companies are not included in its scope. Yet the interactive
           critical to assess the risks of re-identification, and propose   method  might  still be instructive.  For example, if priority
           privacy principles that  minimize  these risks. We do not   areas for open data  were identified in advance, then this
           attempt a granular analysis of the re-identification risk in   could act as a guide for the disclosures made subsequently.
           the  sharing of raw data possibilities  from  Aadhaar   This is discussed further in the next section on
           (although such an exercise would also be valuable). Instead,   implementation.
           we attempt to provide a  heuristic by  which to  understand
           these risks, and recommend some approaches versus others.   4.2. Monitoring and enforcement framework
           In the following section we look at two such approaches:
           1.     Redacting "identifying  information":  This is the
           process of redacting fields of information that are typically  The requirements of the Aadhaar Act are implemented by
           understood to identify individuals. In the case of, say, the  UIDAI through regulations framed by it and the terms and
           telecom  subscriber database, this  might include name,  conditions stipulated in  the agreements that it enters into
           phone number and legally mandated confidential categories  with authorized authentication and eKYC agencies. Further,
                                                              Section 23(2)(p) of the Aadhaar Act entitles UIDAI to
           like Aadhaar number. For a researcher it might well be that
           the existence of a unique identifier would allow far greater  “appoint such committees as may be necessary to assist the
           linkages and insights, particularly when comparing several  Authority in discharge of its functions for the purposes of
           telecom companies’ datasets. However, it is precisely this  this Act”. Drawing from these instruments, we propose the
           that would make individuals identifiable and vulnerable to  following steps to create an implementation framework that
           privacy threats, including from firms that seek to utilize this  can leverage the existing provisions of the Aadhaar Act to
                                                              create an open data framework that is compatible with the
           data for various purposes like marketing or promotions.  principles suggested above.
           Techniques like adding “noise”  -  variations at random to
           the dataset - are being explored as potential solutions. The  Step I:  The preamble  to the  Aadhaar Act  recognizes  the
           re-identification risk in any  Aadhaar linked dataset,  importance of good governance and efficiency, particularly
                                                              in the context of use of public resources. Recognizing the
           including that of subscribers  with only licensed service
           area, gender  and age, should  also be  subject to such  importance of transparency  and accountability as critical
           rigorous assessments.                              tools of good governance, the government and UIDAI
                                                              should agree on  the key  priority areas around  which
           2.     Releasing aggregate statistics:  Ohm points to  Aadhaar related open data needs to be  built. Given the
           another critical lesson - when PII is actually redacted from  nature of data collected by UIDAI,  gender, age and
           the dataset, with minimal risk of re-identification, then the  geographic location, would appear to be the logical choices.
           release  of the dataset on  its  own has  little  value  for
           research. In the telecom dataset example, the primary  Step II:  UIDAI should formulate a new set of regulations
                                                              to implement the  Aadhaar open data policy,  which  would
           insights would be aggregate statistics about total number of
           male/female/transgender, as well as statistics relating to age  include the creation of a  multi-stakeholder open data
                                                              committee. The regulations  will encode principles and
           and licensed service area, and a combination of the three.
           Therefore, the release of  summary  statistics,  without  processes for  generating  Aadhaar related open data. This
           underlying full datasets, might be a preferred option. These  process should be accompanied by a review and
           data points could still prove too valuable for the purposes of  amendment of existing regulations  that  might constrain
                                                              such use. For  instance, the Aadhaar  authentication
           accountability,  research  and policy making. Accordingly,



                                                          – 18 –
   29   30   31   32   33   34   35   36   37   38   39