Page 189 - Proceedings of the 2017 ITU Kaleidoscope
P. 189

Challenges for a data-driven society





































                           Fig. 2. System Architecture of Trend Analysis System for International Standards
           The  TASIS  has  one  configuration  step  and two functions.   implementing the LDA algorithm to extract a representative
           The  configuration  step  is  user  configuration,  designed  to   topic  from  the  document  table,  we  constructed  the  topic
           enable the actual web service by setting the series and year   table,  whose  key  value  is  the  number  of  the  document
           ranges  of  international  standard  document  that  the  user   (Series ID), and is constructed from representative topics for
           wants  to  directly  analyze.  Additionally,  the  number  of   topic modeling. This table is composed of the representative
           clusters and of iterations can be set when applying the LDA   topic  (Topic)  extracted  by  applying  the  LDA  algorithm,
           algorithm for topic modeling. The Topic Modeling Function   Dirichlet  parameter  (Dirichlet  Parameter),  and  occurrence
           is  designed  to  allow  users  to  easily  recognize  the  size  of   rate normalized by the dirichlet parameter (Occurrence Rate).
           each topic, while receiving the topic modeling results after
           applying the LDA algorithm. Another function is the Trend
           Analysis  and  Document  Find  Function,  which  represents
           trend graph for the keywords in the user-selected topic and
           then  provides  a  list  of  relevant  international  standard
           documents for the keyword.

           4.2. Database Construction
           The  entire  collection  of  international  standard  documents
           has been downloaded from the International ITU web page.
           We  are  collected  252  documents  for  the  Y  series (global
           information  infrastructure,  internet  protocol  aspects,  and
           next-generation  networks)  of  the  ITU-T  international       Fig. 3. Database Schema
           standard [15]. As previously mentioned, the collected PDF
           files were converted to text for analysis. We removed from   Particularly, the value of the dirichlet parameter in the Topic
           them  stop  words,  symbols  in  documents,  and  then   table is the word count value as the representative topic of
           implemented  lemmatization  to  obtain  the  correspondence   each  document  [6,  14],  which  has  been  extracted  by
           between keywords.                                  performing  the  LDA  algorithm  after  configuring,  as  the
                                                              number of clusters and iteration is one. The LDA algorithm
           We designed following two tables in our database schema,   is  based  on  the  expectation-maximization  algorithm  in
           so  that  TASIS  implements  the  corresponding  functions   unsupervised  learning  algorithms  [17].  Therefore,  if  the
           according to their purposes. The document table, whose key   number of clusters is one, the word count value is extracted
           value  is  the  number  of  the  document  (Series  ID),  is   regardless of the number of iterations.
           constructed  from  the  actual  data  of  each  document.  As
           shown in Figure 3, the document table is composed of the   Since the number of words in each document is different, we
           title of the document (Title); series of the document (Series),   need to perform a normalization that allows all documents
           such as Y; year of publication (Year); content (Content); and   to  be  viewed  equally.  The  total  sum  of  the  dirichlet
           the  URL  link  to  download  the  document  (Link).  After   parameters  for  each  document  is  divided  by  the  dirichlet



                                                          – 173 –
   184   185   186   187   188   189   190   191   192   193   194