Page 43 - Kaleidoscope Academic Conference Proceedings 2020
P. 43

On top of the complexity of what it takes to look after data   building blocks. An example from the identity above would
           comes the hard part. I have focused on data because it is so   be  the  definition  of  what  is  a  “passport  number”  or  a
           crucial to AI/ML. The specific and probably most important   “driver’s number”.
           issues around how data and AI/ML come together are:
                                                              The location of the identity Attributes and machinery for
               •  A significant reduction in the effort that it takes to  determining a limited set of constraints. The thought is to
                  identify, collect, access, ingest, and process the data  associate these with an aspect such as time or space, or other
                  – organizational efficiency                 attributes that indicate context and validity. In the example
                                                              above that could be the dates of issue and expiration for the
               •  Ensure that the data is sufficient for the purposes it  passport or driver’s license. It could also be the jurisdiction
                  is applied to and at the same time designed to fulfill  in which the passport or driver’s license was issued or any
                  the needs of multiple applications – complete and  similar information.
                  reusable
                                                              The location of the Sources is where the data, its meaning,
               •  That the annotation or labeling of the data brings  and its attributes come from. In the example we have used it
                  with  it  a  level  of  common  sense  that  can  be  could be where a scan of the passport or driver’s license is
                  practically exploited by ML algorithms and models  stored, or it could be the online access to the data base that
                  to eliminate some of the faults that occur in brute  contains  the  driver  license  information  or  any  other
                  applications of AI – semantics and reasoning  information that attests to where the specific identity element
                                                              comes from.
           A major long-term effort on this was started by Tim Berners-
           Lee [14] in taking a stab at building the Semantic Web a bit   Rather than depend on a central repository the idea is to have
           more than twenty years ago. That effort produced protocols,   individuals or organizational units within the enterprise be
           tools, and results that are still useful today but in many way   responsible for labeling and publishing their “Data Points”.
           did not lead to a robust technology that was expected. There   That solves four problems. The first is to leave them with the
           has  been  progress  since  then  and  one  peek  at  the   ownership  of  the  data  and  responsibility  for  it’s  curation,
           advancements  can  be  found  from  the  recent  Dagstuhl   overcoming the natural and sociological barriers to sharing
           Conferences  [15].  Another  thrust  comes  from  the  ideas   data. The second is the principle to have the data closest to
           developed  around  Object  Oriented  Programming  and  the   the point of origin, where the organization has it’s depth of
           Object  Management  Group  [https://www.omg.org/]  that   expertise and knowledge and best intuition about the data.
           developed  formal  methods  for  dealing  with  data  and  the   The  third  is  that  publishing  of  the  “Data  Point”  makes  it
           database  schemas.  Looking  at  the  foundational  building   discoverable  and  useful  for  applications  above the  atomic
           blocks, in some sense the construct of the Semantic Web is   level. The fourth is scalability which comes from the ability
           too  loose  and that  from  OMG too  rigid  in achieving  high   to build more complex “Data Points” by reusing the atomic
           levels  of  reusability  for  the  data  and  high  efficiency  in   “Data Point” population.
           collecting and delivering data across a distributed enterprise.
           What I would like to describe is an approach that builds on
           top of what has been created but with an added twist that fits
           the  needs  of  the  data  used  in  the  enterprise  space  for
           improving  industrial  processes  and  specifically  enables  a
           great deal of automation in the preparing of ML models and
           algorithms. It starts with the “Data Point” protocol. This is
           similar to the way we use URLs in designating the location
           of specific resources and information at an atomic level. In
           this  case  the  protocol  resolves  and  captures  a  four-
           dimensional space that consists of resolution for:

           All the Identities that the “Data” may have and where the
           identities are located. As an example, the identity could be   Figure 8 - The Data Point Protocol - the resolution of
           the first, middle, and last name of a person, it could be the   Identity, Meaning, Attibutes, and Sources
           person’s passport or driver’s license number, or it could be a
           biometric  fingerprint  for  the  person.  If  the  person  has  a   To make the “Data Point” protocol useful requires a stack of
           nickname or multiple nicknames used in different settings   higher level services. Without going into an exhaustive list
           that could also be included.                       there are a number of core services that matter. The contents
                                                              of each “Data Point” may contain sensitive, proprietary, or
           The location where the Meanings of each identity element   regulated information. To deal with the protection of such
           are located and a description of what that identity signifies   data one such service is a Controller that administers the
           where  the  meanings  are  common  descriptors  and  can  be   bundle of functions around access control, security, privacy,
           found in an ontology so that they can be reusable elemental   trust, and assurance. This includes the building blocks for





                                                           – xxxix –
   38   39   40   41   42   43   44   45   46   47   48