Page 188 - Proceedings of the 2017 ITU Kaleidoscope
P. 188

2017 ITU Kaleidoscope Academic Conference




           2.2. Latent Dirichlet Allocation (LDA)             trends [15]. The data is in PDF format and categorized by
           Topic modeling is a document analysis model that predicts   each  series,  for  a  total  of  23  series.  The  series  are  then
           the structure of a document by expressing it as a stochastic   categorized into international standard documents related to
           mixture of topics and each topic as a distribution of words.   each subject.
           In  other  words,  it  is  a  statistical  inference  model  for
           determining  hidden  topics  in  documents.  One of the most   The  ITU-T  International  Standard  document  consists  of
           popular  topic  modeling  techniques  is  the  LDA  algorithm.   cover,  summary,  and  contents  [16].  First,  the  cover  has a
           The algorithm is attracting attention as a new paradigm of   representative  title.  Second,  the  summary  summarizes  the
           semantic expression, which overcomes the disadvantages of   document. Finally, the content contains descriptions such as
           Probabilistic  Latent  Semantic  Indexing  (PLSI)-based  topic   scope,  references,  definitions,  abbreviations,  conventions,
           representation techniques [13].                    content,  annex,  and  bibliography  as  per  Table  1.  Many
                                                              individuals,  such  as  researchers  and  developers,  develop
           As previously mentioned, the LDA algorithm is a generation   technologies  based  on  the  standards  written  in  each
           model  that  finds  hidden  topics  in  a  document.  The   international  standard  document.  Since  each  international
           generation  model  is  the  process  of  creating  the  actual   standard document is based a title, it becomes important data
           document,  and  models  what  subjects  are  included in each   for  topic  modeling.  Therefore,  the  entire  text  in  each
           document to create it. Therefore, it infers hidden variables   standard document is the main data for trend analysis.
           such as document structure through observed variables such
           as words. As a result, we can understand the subject of  an   Table 1. Example of structure of
           entire document set from each topic. This relationship can    Recommendation ITU-T Y.3501
           be  expressed  as  a  stochastic  graph  model  as  per Figure 1   Number   Table of Contents   Page
           [14].                                                  1     Scope                             1
                                                                  2     References                        1
           The N plate denotes the collection of words and the D plate   3   Definitions                  1
           the collection of topics. The K plate denotes the number of   4   Abbreviations and acronyms   2
           clusters.  Each  node  is  a  random  variable,  and  is  labeled   5   Conventions            3
           according  to  its  role  in  the  generative  process.  The  LDA   6   Overview of big data   3
           algorithm  can  predict  hidden  variables,  such  as  topic   7   Cloud computing based big data   6
           proportions  (θ),  per-word  topic  assignment  (Z),  and  topic   8   Requirements  of  cloud  computing   11
           (β) using observed variables such as words and documents.    based big data
           These  parameters  are  extracted  based  on  the  dirichlet   9   Cloud  computing  based  big  data   14
           parameters  (α,  η).  The  dirichlet  parameters  are  the  word   10   capabilities          16
                                                                        Security considerations
           counts, including each cluster.                     Appendix  Use  cases  of  cloud  computing  in
                                                                  1     support of big data              17
                                                               Appendix  Use cases of cloud  computing  based
                                                                  2     big data as analysis services    26
                                                               Appendix  Mapping of big data  ecosystem  roles   29
                                                                  3     into user view of ITU-T Y.3502
                                                                        Bibliography                     30

                                                                      4. TREND ANALYSIS SYSTEM FOR
                                                                    INTERNATIONAL STANDARDS (TASIS)

                                                              4.1. System Architecture of  Trend  Analysis  System  for
                 Fig. 1. Probabilistic Graph Model of LDA     International Standards (TASIS)
                                                              We have designed a system architecture for TASIS using text
             3. INTERNATIONAL STANDARDS DOCUMENTS             mining  as  per  Figure  2.  First,  we  collected  ITU-T
                                                              Recommendations  documents  and  created a document set.
           ITU was established in May 1865 under the United Nations,   Second,  we  preprocessed  a  document  set  for  analysis
           comprising  three  sectors:  the  radio-communication sector,   converting  it  from  PDF  to  TXT,  and  then  we  deleted
           which  deals  with  issues  such  as  radio  regulations  and   unnecessary  data  and  unified  the  data  format  using  Java
           frequency  allocation;  the  telecommunication sector, which   language.  In  other  words,  it  is  designed  to  integrate
           deals  with  issues  of  telecommunication  technology  and   keywords  by  deleting  symbols  and  specific  characters  and
           operation;  and  the  development  sector,  which  deals  with   applying  preprocessing  techniques,  such  as  lemmatization.
           policy and technical support for network modernization.   The database consists of two tables. The document table is
                                                              constructed to show the results of the LDA algorithm. This
           The ITU-T is a standardization-related sector in ITU, related   table is used as a database for the topic modeling function.
           to telecommunication technology. In this study, we collected   The  topic  table  is  constructed  to  show  the  representative
           a  total  of  252  ITU-T  Recommendations  data  of  Y  series   topic of each document. This table is used as a database for
           from the ITU official site to analyze international standard   trend analysis and the document find function.




                                                          – 172 –
   183   184   185   186   187   188   189   190   191   192   193