Page 35 - Proceedings of the 2018 ITU Kaleidoscope
P. 35

Machine learning for a 5G future




                                                                  contains all of the measurements made on the i-th cable
                                                                 
                                                              modem, so    = {   ;    ; … ;    4,458   } and    = 4,458.
                                                                           
                                                                                 2  
                                                                              1  
                                           2
                         WCSS = ∑ ∑    ∙        
                                          
                                   =1    =1                   After  a  few  attempts  to  work  with  the  original  data,  we
                                                              decided  to  standardize  each  cable  modem’s  values  –that
           Where     is the size of the cluster k, and     is the estimated   means  that  we  take  each  signal  strength  value,  minus  the
                                             2
                    
                                                 
           variance of the variable j, in the same cluster.   cable modem mean, divided by its standard deviation-. We
                                                              found  that  without  standardization,  the  algorithm  would
                             3.3 Estimation of K              catch  that  some  cable  modems  receive  stronger  signals,
                                                              which may be useful to solve other kinds of problems but not
           At the beginning of this section, we assumed that we already   to report on signal impairment.
           knew the value of K. It is possible -as it happens in this work-
           that the value of K is unknown. The WCSS is minimum when       4.1 The search for the value of K
           there are as many clusters as observations; therefore, to look
           for its absolute minimum is not an appropriate criterion when   In order to apply the elbow rule, Figure 1 shows the values
           determining  the  value  of  K.  Nevertheless,  there  is  an   of the WCSS at different K values. It seems that there is an
           empirical  approach,  the  elbow  rule  [10],  which  uses  this   elbow at K=5. After that, there is a marginal decreasing of
           statistic to determine K. It consists in plotting the values of   WCSS and a second elbow at K=10.
           WCSS against the values of K from which we obtained them.
           The decision is based on where the curve shows an elbow or
           inflection point.

           There is another exploratory method, called  stable classes
           [11], that consists in executing the k-means algorithm to the
           same dataset, starting at different random centers every time,
           to identify stable groupings. It has the advantage that leads
           to identifying high-density areas in the p-dimensional space.
           On the other hand, it may lead to a high number of classes.
           Therefore, the number of groups we should keep are those of
           greater stable classes.

           If the variables follow a Normal distribution, we could also
           obtain an approximate F test to assess the decreasing in the
           within-cluster variance when K increases one unit:

                                  (  ) −         (   + 1)
                         =
                                 (   + 1)/(   −    − 1)

           Where n refers  to  the  total  sample  size.  To  know  if  the   Figure 1 - Plot of K vs. WCSS
           variance  decreases  significantly,  we  could  compare  the
           statistic to an F distribution with p;  p(n − K − 1) degrees of   With the purpose of applying the stable classes method, the
           freedom. There is also an empirical rule that says that we   k-means algorithm was executed with two different random
           should keep K + 1 clusters if the F value is greater than 10   starts and K=10.
           [12].
                                                                 Table 2 - Distribution among K=10 clusters after two
               4.  CHARACTERIZING SIGNAL INGRESS                   executions, as percentage of total modem count.

           In this section, we present the results obtained in the search   C1   C2     C3      C4     C5
           for  K, as  well as  the clusters that  we  found, and how  we   C1   5,94   7,13   0   6,89   0
           interpreted  the  groups.  Despite  the  fact  that  there  are
           guidelines  to  estimate  this  parameter,  the  methods  are   C2   15,44   0   0,48   1,43   0
           exploratory and the decision finally depends on the analyst’s   C3   1,66   0   9,03   0    0
           expertise. Here we show the best results that we could get to   C4   0   11,88   0   0      0
           the moment.                                             C5     5,70   2,61    0     0,24    0
                                                                   C6      0      0      0      0     5,70
           We already stated that we have 4,458 measurements on each   C7   0     0     0,48    0      0
           cable modem, and we interpret every one of them as a new   C8   0      0      0      0     2,38
           dimension of analysis. Therefore, the p-dimensional vector   C9   0    0     0,24    0      0
                                                                   C10     0     0,71    0      0      0





                                                           – 19 –
   30   31   32   33   34   35   36   37   38   39   40