Page 188 - Proceedings of the 2017 ITU Kaleidoscope
P. 188
2017 ITU Kaleidoscope Academic Conference
2.2. Latent Dirichlet Allocation (LDA) trends [15]. The data is in PDF format and categorized by
Topic modeling is a document analysis model that predicts each series, for a total of 23 series. The series are then
the structure of a document by expressing it as a stochastic categorized into international standard documents related to
mixture of topics and each topic as a distribution of words. each subject.
In other words, it is a statistical inference model for
determining hidden topics in documents. One of the most The ITU-T International Standard document consists of
popular topic modeling techniques is the LDA algorithm. cover, summary, and contents [16]. First, the cover has a
The algorithm is attracting attention as a new paradigm of representative title. Second, the summary summarizes the
semantic expression, which overcomes the disadvantages of document. Finally, the content contains descriptions such as
Probabilistic Latent Semantic Indexing (PLSI)-based topic scope, references, definitions, abbreviations, conventions,
representation techniques [13]. content, annex, and bibliography as per Table 1. Many
individuals, such as researchers and developers, develop
As previously mentioned, the LDA algorithm is a generation technologies based on the standards written in each
model that finds hidden topics in a document. The international standard document. Since each international
generation model is the process of creating the actual standard document is based a title, it becomes important data
document, and models what subjects are included in each for topic modeling. Therefore, the entire text in each
document to create it. Therefore, it infers hidden variables standard document is the main data for trend analysis.
such as document structure through observed variables such
as words. As a result, we can understand the subject of an Table 1. Example of structure of
entire document set from each topic. This relationship can Recommendation ITU-T Y.3501
be expressed as a stochastic graph model as per Figure 1 Number Table of Contents Page
[14]. 1 Scope 1
2 References 1
The N plate denotes the collection of words and the D plate 3 Definitions 1
the collection of topics. The K plate denotes the number of 4 Abbreviations and acronyms 2
clusters. Each node is a random variable, and is labeled 5 Conventions 3
according to its role in the generative process. The LDA 6 Overview of big data 3
algorithm can predict hidden variables, such as topic 7 Cloud computing based big data 6
proportions (θ), per-word topic assignment (Z), and topic 8 Requirements of cloud computing 11
(β) using observed variables such as words and documents. based big data
These parameters are extracted based on the dirichlet 9 Cloud computing based big data 14
parameters (α, η). The dirichlet parameters are the word 10 capabilities 16
Security considerations
counts, including each cluster. Appendix Use cases of cloud computing in
1 support of big data 17
Appendix Use cases of cloud computing based
2 big data as analysis services 26
Appendix Mapping of big data ecosystem roles 29
3 into user view of ITU-T Y.3502
Bibliography 30
4. TREND ANALYSIS SYSTEM FOR
INTERNATIONAL STANDARDS (TASIS)
4.1. System Architecture of Trend Analysis System for
Fig. 1. Probabilistic Graph Model of LDA International Standards (TASIS)
We have designed a system architecture for TASIS using text
3. INTERNATIONAL STANDARDS DOCUMENTS mining as per Figure 2. First, we collected ITU-T
Recommendations documents and created a document set.
ITU was established in May 1865 under the United Nations, Second, we preprocessed a document set for analysis
comprising three sectors: the radio-communication sector, converting it from PDF to TXT, and then we deleted
which deals with issues such as radio regulations and unnecessary data and unified the data format using Java
frequency allocation; the telecommunication sector, which language. In other words, it is designed to integrate
deals with issues of telecommunication technology and keywords by deleting symbols and specific characters and
operation; and the development sector, which deals with applying preprocessing techniques, such as lemmatization.
policy and technical support for network modernization. The database consists of two tables. The document table is
constructed to show the results of the LDA algorithm. This
The ITU-T is a standardization-related sector in ITU, related table is used as a database for the topic modeling function.
to telecommunication technology. In this study, we collected The topic table is constructed to show the representative
a total of 252 ITU-T Recommendations data of Y series topic of each document. This table is used as a database for
from the ITU official site to analyze international standard trend analysis and the document find function.
– 172 –