Page 111 - Big data - Concept and application for telecommunications
P. 111

Big data - Concept and application for telecommunications                       3

            –       tracking back sources of errors;

            –       allowing automated re-enactment of derivations to update a data;
            –       providing attribution of data sources.
            Provenance  information  (PI)  is  composed of  a  set of  data  flows, and  each flow contains  information of
            processes (f), data sources (d) and responsible parties (p). In this sense, PI is notated as:
                                                     PI = {(f, p), (d, p)}
            A data flow is divided into a directly associated flow and subordinately associated flow. For example, in
            Figure 6-1, the provenance information (PI) about Data d is composed by a set of:
            –       directly associated flow: PI ={(f2, pC), (Data c, pC)};
            –       subordinately associated flow: PI (Data c) = {(f1, pC), ((Data a, pA), (Data b, pB))}

                                   Figure 6-1 – An example of data provenance information

            6.2     Data provenance in big data ecosystem
            In a big data environment, complex data processing and migration due to the big data lifecycle operations
            (e.g.,  data  generation,  transmission,  storage,  use,  and  deletion)  and  data  distribution  cause  various
            difficulties  in  managing  the  data  provenance.  According  to  the  big  data  ecosystem  described  in  [ITU-T
            Y.3600], big data provenance needs to treat:

            –       huge volumes of non-structured, semi-structured and structured data;
            –       functions description for various types and formats of data;
            –       data traceability across multi-application domains.
                    NOTE 1 – Application domain is an area of knowledge or activity applied for one specific economic,
                    commercial, social or administrative scope [b-ITU-T Y.4100].
                    NOTE 2 – Transport application domain, health application domain and government application
                    domain are examples of application domains.
            In addition, big data computing environment causes several challenges for data provenance such as:
            –       efficient storing mechanism for provenance data: The size of provenance data can be larger than
                    the original data, causing storage overhead;
            –       minimize provenance collection overhead: In a distributed system environment, consideration of
                    the recording provenance and computation cost together is important;

                                                   Static data – Data provenance, data formats and trust   103
   106   107   108   109   110   111   112   113   114   115   116