Page 42 - Kaleidoscope Academic Conference Proceedings 2020
P. 42

4.2    Technologies in Focus – Data Science and Data   heterogeneous data types, to label the data automatically, to
           Engineering                                        reuse the data for multiple purposes, to verify the quality and
                                                              origin  of  the  data,  and  to  efficiently  deliver  the  data  in  a
           "Information is the oil of the 21st century, and analytics is   distributed  environment.  The  organizational/sociological
           the  combustion  engine."  This  is  a  quote  from  Peter   aspects  include  data  governance,  and  overcoming  the
           Sondergaard,  senior  vice  president  and  global  head  of   reluctance  of  enterprise  units  to  share  data  across
           Research  at  Gartner,  Inc.  The  quote  may  be  used  in   organizational boundaries – avoiding stovepipes!
           discussing the importance of data and data analytics. It came
           from  a  speech  given  by  Mr.  Sondergaard  at  the  Gartner   From an enterprise perspective the collection and storage of
           Symposium/ITxpo in October 2011 in Orlando, Florida, an   data includes  isolated  pools,  defined  by  physical  location,
           early but probably not the first expression of the sentiment.   organizational  structure,  or  quite  frequently  by  domain
           This is a frequently repeated metaphor and has appeared in   specific  needs  within  the  enterprise.  Such  data  is  hard  to
           newspaper articles and on the cover of multiple prestigious   discover  and  difficult  to  share,  even  though  its  general
           magazines. An example is the Economist report published   availability  could  bring  significant  value.  On  the  positive
           May 6 2017, titled “The world’s most valuable resource is   side, data is usually born or created close to the source. The
           no longer oil, but data”.                          personnel involved in creating or collecting the data have the
                                                              greatest intuition  for  what  the  data means  and  are  usually
                                                              direct  consumers  of  the  data.  They  are,  however,  not
                                                              necessarily the ones who perform large scale analysis. This
                                                              may be the cause for part of the asymmetry between data
                                                              usage and data analysis numbers reported earlier. There is a
                                                              middle ground where data is designed to be shared between
                                                              departments    within    an    enterprise.    Somewhat
                                                              overgeneralizing, it can be characterized as limited in scope,
                                                              incomplete in what it addresses, and inflexible in how it is
                                                              structured. Lastly, there are large centralized enterprise data
                                                              stores. I would refer to these as monolithic approaches that
                                                              seek to create a unified repository that represents the “Single
                                                              Truth” of the enterprise’s status. Inherent to the last approach
                                                              is the gulf between the keepers of the data and the originators,
            Figure 6 - Taken from the cover of the Economist May 6,   in intuition, in time, and in distance defined by separation in
                                 2017.                        organizational location. In addition, even though it may be
                                                              desirable, not all data is in fact exact or reliable, nor does all
           Data is so important for AI and ML that it is worthwhile to   data lack ambiguity. Additionally, the data is not immutable
           look  at  what  is  happening  in  the  data  world.  Without  a   and can change in time, due to evolving circumstances, and
           reference (because I have no way of verifying if it’s true or   the  data’s  meaning  can  be  highly  dependent  on  context.
           not, but it feels right), I have repeatedly heard two stunning   Without  delving  too  deeply  into  details  it  is  generally
           numbers. The first is the number of people around the world   recognized  that  data  has  a  number  of  “V”  characteristics
           employed in data gathering and processing; the estimate is   some of which are listed in Figure 7.
           about 25 million on the low end and 54 million on the high
           end. The second is the fraction of data actually used and the
           fraction  analyzed;  the  usage  number  is  2%  -  4%  and  the
           number analyzed is ~ 0.1%. That means that we are badly
           misusing the world’s greatest resource! The pace at which
           new data is being created is more than 2.5 quintillion bytes a
           day. According to the Information Overload Research Group,
           over 90% of all data now stored has been created within the
           last two years [https://iorgforum.org/].

           There  are  two  challenges  here:  (1)  is  making  it  easier  to
           create “high quality” data that is so essential for AI/ML; and   Figure 7 - Data characteristics for enterprise applications
           (2) is  lowering  the  frictional  cost  of  curating,  and  safely
           delivering  data  so  it  can  be  purposefully  consumed  and  In addition, there is another dimension to data science and
           shared  among  as  many  applications  and  organizations  as  data engineering that deals with how the data is to be stored,
           possible.  These  two  challenges  have  a  technical  and  a  retrieved, updated, and maintained. In conjunction with the
           sociological/organizational aspect.                four  new  infrastructures  that  were  previously  described
                                                              much of the data will be distributed and the uses of the data
           Historically there have been many attempts over the years to   will be distributed as well, and increasing the data may well
           deal  with  both  of  these  challenges.  To  be  explicit  the   be associated with mobility.
           technical  aspects  include  the  capability  to  deal  with





                                                           – xxxviii –
   37   38   39   40   41   42   43   44   45   46   47