Page 111 - Big data - Concept and application for telecommunications
P. 111
Big data - Concept and application for telecommunications 3
– tracking back sources of errors;
– allowing automated re-enactment of derivations to update a data;
– providing attribution of data sources.
Provenance information (PI) is composed of a set of data flows, and each flow contains information of
processes (f), data sources (d) and responsible parties (p). In this sense, PI is notated as:
PI = {(f, p), (d, p)}
A data flow is divided into a directly associated flow and subordinately associated flow. For example, in
Figure 6-1, the provenance information (PI) about Data d is composed by a set of:
– directly associated flow: PI ={(f2, pC), (Data c, pC)};
– subordinately associated flow: PI (Data c) = {(f1, pC), ((Data a, pA), (Data b, pB))}
Figure 6-1 – An example of data provenance information
6.2 Data provenance in big data ecosystem
In a big data environment, complex data processing and migration due to the big data lifecycle operations
(e.g., data generation, transmission, storage, use, and deletion) and data distribution cause various
difficulties in managing the data provenance. According to the big data ecosystem described in [ITU-T
Y.3600], big data provenance needs to treat:
– huge volumes of non-structured, semi-structured and structured data;
– functions description for various types and formats of data;
– data traceability across multi-application domains.
NOTE 1 – Application domain is an area of knowledge or activity applied for one specific economic,
commercial, social or administrative scope [b-ITU-T Y.4100].
NOTE 2 – Transport application domain, health application domain and government application
domain are examples of application domains.
In addition, big data computing environment causes several challenges for data provenance such as:
– efficient storing mechanism for provenance data: The size of provenance data can be larger than
the original data, causing storage overhead;
– minimize provenance collection overhead: In a distributed system environment, consideration of
the recording provenance and computation cost together is important;
Static data – Data provenance, data formats and trust 103