Page 113 - Big data - Concept and application for telecommunications
P. 113

Big data - Concept and application for telecommunications                       3


            Figure 7-1 shows the use of data provenance in big data ecosystem:

            –       when data is imported from an outside data source (data provider (DP):data supplier (DS)) and
                    stored,  BDSP  (big  data  system  A)  generates  the  metadata  based  on  importing  context  (e.g.,
                    responsible party information, time, size) and these metadata entities are used for provenance
                    information;
            –       BDSP  A  monitors  and  stores  a  process  of  data  mash-up  or  analysis  as  a  form  of  provenance
                    information to ensure the reliability of data quality and reproducibility of analysis result;

            –       when BDSP A exports data to BDSP B or registers data catalogue to a data registry in data market
                    (DP:data broker (DB)), BDSP A delivers provenance information.
                    NOTE – When a BDSP exports or registers data with its provenance information, BDSP manages the
                    level of detail through simplification of provenance information based on their own data or service
                    policy.

            7.2     Conceptual model of big data provenance information

            Big data provenance information is an extension of the general data provenance concept, which is described
            in clause 6.1.





















                              Figure 7-2 – Conceptual model for big data provenance information
            Figure 7-2 shows a high-level conceptual model for big data provenance information. Big data provenance
            information  (BD_ProvenanceInformation)  is  an  aggregated  set  of  big  data  provenance  units
            (BD_ProvenanceUnit) which records a history of the most recent changes to the data.
            Big data provenance unit is a minimum set of big data provenance information. It provides information about
            data ownership or authority (ResponsibleParty), data processing environment (ComputationalEnvironment),
            and a sequence of functions (PI_Function) with input and output data (BD_DataInstance) which are involved
            in data mash-up or analysis.

            NOTE 1 – A workflow depicts the actual sequence of the functions to describe a data processing. In the big
            data  provenance  information  model,  a  workflow  can  be  derived  by  the  association  +processStep  which
            describes the sequence of PI_functions.
            NOTE 2 – BD_DataInstance is a metadata composed of identifiable information (e.g., access information,
            type and format of data, data size, date, personally identifiable information (PII)) for a data instance.
            Figure 7-3 illustrates an example of capturing the provenance unit on Data C.













                                                   Static data – Data provenance, data formats and trust   105
   108   109   110   111   112   113   114   115   116   117   118