Page 42 - Kaleidoscope Academic Conference Proceedings 2020
P. 42
4.2 Technologies in Focus – Data Science and Data heterogeneous data types, to label the data automatically, to
Engineering reuse the data for multiple purposes, to verify the quality and
origin of the data, and to efficiently deliver the data in a
"Information is the oil of the 21st century, and analytics is distributed environment. The organizational/sociological
the combustion engine." This is a quote from Peter aspects include data governance, and overcoming the
Sondergaard, senior vice president and global head of reluctance of enterprise units to share data across
Research at Gartner, Inc. The quote may be used in organizational boundaries – avoiding stovepipes!
discussing the importance of data and data analytics. It came
from a speech given by Mr. Sondergaard at the Gartner From an enterprise perspective the collection and storage of
Symposium/ITxpo in October 2011 in Orlando, Florida, an data includes isolated pools, defined by physical location,
early but probably not the first expression of the sentiment. organizational structure, or quite frequently by domain
This is a frequently repeated metaphor and has appeared in specific needs within the enterprise. Such data is hard to
newspaper articles and on the cover of multiple prestigious discover and difficult to share, even though its general
magazines. An example is the Economist report published availability could bring significant value. On the positive
May 6 2017, titled “The world’s most valuable resource is side, data is usually born or created close to the source. The
no longer oil, but data”. personnel involved in creating or collecting the data have the
greatest intuition for what the data means and are usually
direct consumers of the data. They are, however, not
necessarily the ones who perform large scale analysis. This
may be the cause for part of the asymmetry between data
usage and data analysis numbers reported earlier. There is a
middle ground where data is designed to be shared between
departments within an enterprise. Somewhat
overgeneralizing, it can be characterized as limited in scope,
incomplete in what it addresses, and inflexible in how it is
structured. Lastly, there are large centralized enterprise data
stores. I would refer to these as monolithic approaches that
seek to create a unified repository that represents the “Single
Truth” of the enterprise’s status. Inherent to the last approach
is the gulf between the keepers of the data and the originators,
Figure 6 - Taken from the cover of the Economist May 6, in intuition, in time, and in distance defined by separation in
2017. organizational location. In addition, even though it may be
desirable, not all data is in fact exact or reliable, nor does all
Data is so important for AI and ML that it is worthwhile to data lack ambiguity. Additionally, the data is not immutable
look at what is happening in the data world. Without a and can change in time, due to evolving circumstances, and
reference (because I have no way of verifying if it’s true or the data’s meaning can be highly dependent on context.
not, but it feels right), I have repeatedly heard two stunning Without delving too deeply into details it is generally
numbers. The first is the number of people around the world recognized that data has a number of “V” characteristics
employed in data gathering and processing; the estimate is some of which are listed in Figure 7.
about 25 million on the low end and 54 million on the high
end. The second is the fraction of data actually used and the
fraction analyzed; the usage number is 2% - 4% and the
number analyzed is ~ 0.1%. That means that we are badly
misusing the world’s greatest resource! The pace at which
new data is being created is more than 2.5 quintillion bytes a
day. According to the Information Overload Research Group,
over 90% of all data now stored has been created within the
last two years [https://iorgforum.org/].
There are two challenges here: (1) is making it easier to
create “high quality” data that is so essential for AI/ML; and Figure 7 - Data characteristics for enterprise applications
(2) is lowering the frictional cost of curating, and safely
delivering data so it can be purposefully consumed and In addition, there is another dimension to data science and
shared among as many applications and organizations as data engineering that deals with how the data is to be stored,
possible. These two challenges have a technical and a retrieved, updated, and maintained. In conjunction with the
sociological/organizational aspect. four new infrastructures that were previously described
much of the data will be distributed and the uses of the data
Historically there have been many attempts over the years to will be distributed as well, and increasing the data may well
deal with both of these challenges. To be explicit the be associated with mobility.
technical aspects include the capability to deal with
– xxxviii –