Page 43 - Kaleidoscope Academic Conference Proceedings 2020
P. 43
On top of the complexity of what it takes to look after data building blocks. An example from the identity above would
comes the hard part. I have focused on data because it is so be the definition of what is a “passport number” or a
crucial to AI/ML. The specific and probably most important “driver’s number”.
issues around how data and AI/ML come together are:
The location of the identity Attributes and machinery for
• A significant reduction in the effort that it takes to determining a limited set of constraints. The thought is to
identify, collect, access, ingest, and process the data associate these with an aspect such as time or space, or other
– organizational efficiency attributes that indicate context and validity. In the example
above that could be the dates of issue and expiration for the
• Ensure that the data is sufficient for the purposes it passport or driver’s license. It could also be the jurisdiction
is applied to and at the same time designed to fulfill in which the passport or driver’s license was issued or any
the needs of multiple applications – complete and similar information.
reusable
The location of the Sources is where the data, its meaning,
• That the annotation or labeling of the data brings and its attributes come from. In the example we have used it
with it a level of common sense that can be could be where a scan of the passport or driver’s license is
practically exploited by ML algorithms and models stored, or it could be the online access to the data base that
to eliminate some of the faults that occur in brute contains the driver license information or any other
applications of AI – semantics and reasoning information that attests to where the specific identity element
comes from.
A major long-term effort on this was started by Tim Berners-
Lee [14] in taking a stab at building the Semantic Web a bit Rather than depend on a central repository the idea is to have
more than twenty years ago. That effort produced protocols, individuals or organizational units within the enterprise be
tools, and results that are still useful today but in many way responsible for labeling and publishing their “Data Points”.
did not lead to a robust technology that was expected. There That solves four problems. The first is to leave them with the
has been progress since then and one peek at the ownership of the data and responsibility for it’s curation,
advancements can be found from the recent Dagstuhl overcoming the natural and sociological barriers to sharing
Conferences [15]. Another thrust comes from the ideas data. The second is the principle to have the data closest to
developed around Object Oriented Programming and the the point of origin, where the organization has it’s depth of
Object Management Group [https://www.omg.org/] that expertise and knowledge and best intuition about the data.
developed formal methods for dealing with data and the The third is that publishing of the “Data Point” makes it
database schemas. Looking at the foundational building discoverable and useful for applications above the atomic
blocks, in some sense the construct of the Semantic Web is level. The fourth is scalability which comes from the ability
too loose and that from OMG too rigid in achieving high to build more complex “Data Points” by reusing the atomic
levels of reusability for the data and high efficiency in “Data Point” population.
collecting and delivering data across a distributed enterprise.
What I would like to describe is an approach that builds on
top of what has been created but with an added twist that fits
the needs of the data used in the enterprise space for
improving industrial processes and specifically enables a
great deal of automation in the preparing of ML models and
algorithms. It starts with the “Data Point” protocol. This is
similar to the way we use URLs in designating the location
of specific resources and information at an atomic level. In
this case the protocol resolves and captures a four-
dimensional space that consists of resolution for:
All the Identities that the “Data” may have and where the
identities are located. As an example, the identity could be Figure 8 - The Data Point Protocol - the resolution of
the first, middle, and last name of a person, it could be the Identity, Meaning, Attibutes, and Sources
person’s passport or driver’s license number, or it could be a
biometric fingerprint for the person. If the person has a To make the “Data Point” protocol useful requires a stack of
nickname or multiple nicknames used in different settings higher level services. Without going into an exhaustive list
that could also be included. there are a number of core services that matter. The contents
of each “Data Point” may contain sensitive, proprietary, or
The location where the Meanings of each identity element regulated information. To deal with the protection of such
are located and a description of what that identity signifies data one such service is a Controller that administers the
where the meanings are common descriptors and can be bundle of functions around access control, security, privacy,
found in an ontology so that they can be reusable elemental trust, and assurance. This includes the building blocks for
– xxxix –