Page 113 - Big data - Concept and application for telecommunications
P. 113
Big data - Concept and application for telecommunications 3
Figure 7-1 shows the use of data provenance in big data ecosystem:
– when data is imported from an outside data source (data provider (DP):data supplier (DS)) and
stored, BDSP (big data system A) generates the metadata based on importing context (e.g.,
responsible party information, time, size) and these metadata entities are used for provenance
information;
– BDSP A monitors and stores a process of data mash-up or analysis as a form of provenance
information to ensure the reliability of data quality and reproducibility of analysis result;
– when BDSP A exports data to BDSP B or registers data catalogue to a data registry in data market
(DP:data broker (DB)), BDSP A delivers provenance information.
NOTE – When a BDSP exports or registers data with its provenance information, BDSP manages the
level of detail through simplification of provenance information based on their own data or service
policy.
7.2 Conceptual model of big data provenance information
Big data provenance information is an extension of the general data provenance concept, which is described
in clause 6.1.
Figure 7-2 – Conceptual model for big data provenance information
Figure 7-2 shows a high-level conceptual model for big data provenance information. Big data provenance
information (BD_ProvenanceInformation) is an aggregated set of big data provenance units
(BD_ProvenanceUnit) which records a history of the most recent changes to the data.
Big data provenance unit is a minimum set of big data provenance information. It provides information about
data ownership or authority (ResponsibleParty), data processing environment (ComputationalEnvironment),
and a sequence of functions (PI_Function) with input and output data (BD_DataInstance) which are involved
in data mash-up or analysis.
NOTE 1 – A workflow depicts the actual sequence of the functions to describe a data processing. In the big
data provenance information model, a workflow can be derived by the association +processStep which
describes the sequence of PI_functions.
NOTE 2 – BD_DataInstance is a metadata composed of identifiable information (e.g., access information,
type and format of data, data size, date, personally identifiable information (PII)) for a data instance.
Figure 7-3 illustrates an example of capturing the provenance unit on Data C.
Static data – Data provenance, data formats and trust 105