Page 189 - Proceedings of the 2017 ITU Kaleidoscope
P. 189
Challenges for a data-driven society
Fig. 2. System Architecture of Trend Analysis System for International Standards
The TASIS has one configuration step and two functions. implementing the LDA algorithm to extract a representative
The configuration step is user configuration, designed to topic from the document table, we constructed the topic
enable the actual web service by setting the series and year table, whose key value is the number of the document
ranges of international standard document that the user (Series ID), and is constructed from representative topics for
wants to directly analyze. Additionally, the number of topic modeling. This table is composed of the representative
clusters and of iterations can be set when applying the LDA topic (Topic) extracted by applying the LDA algorithm,
algorithm for topic modeling. The Topic Modeling Function Dirichlet parameter (Dirichlet Parameter), and occurrence
is designed to allow users to easily recognize the size of rate normalized by the dirichlet parameter (Occurrence Rate).
each topic, while receiving the topic modeling results after
applying the LDA algorithm. Another function is the Trend
Analysis and Document Find Function, which represents
trend graph for the keywords in the user-selected topic and
then provides a list of relevant international standard
documents for the keyword.
4.2. Database Construction
The entire collection of international standard documents
has been downloaded from the International ITU web page.
We are collected 252 documents for the Y series (global
information infrastructure, internet protocol aspects, and
next-generation networks) of the ITU-T international Fig. 3. Database Schema
standard [15]. As previously mentioned, the collected PDF
files were converted to text for analysis. We removed from Particularly, the value of the dirichlet parameter in the Topic
them stop words, symbols in documents, and then table is the word count value as the representative topic of
implemented lemmatization to obtain the correspondence each document [6, 14], which has been extracted by
between keywords. performing the LDA algorithm after configuring, as the
number of clusters and iteration is one. The LDA algorithm
We designed following two tables in our database schema, is based on the expectation-maximization algorithm in
so that TASIS implements the corresponding functions unsupervised learning algorithms [17]. Therefore, if the
according to their purposes. The document table, whose key number of clusters is one, the word count value is extracted
value is the number of the document (Series ID), is regardless of the number of iterations.
constructed from the actual data of each document. As
shown in Figure 3, the document table is composed of the Since the number of words in each document is different, we
title of the document (Title); series of the document (Series), need to perform a normalization that allows all documents
such as Y; year of publication (Year); content (Content); and to be viewed equally. The total sum of the dirichlet
the URL link to download the document (Link). After parameters for each document is divided by the dirichlet
– 173 –