Page 132 - ITU Journal, Future and evolving technologies - Volume 1 (2020), Issue 1, Inaugural issue
P. 132

ITU Journal on Future and Evolving Technologies, Volume 1 (2020), Issue 1




          continuously   interacting   with   a   training     where  φ1,  φ2  and  φ3  are  the  weights  of  each
          environment and updating the NN parameters as a      component.
          result  of  these  interactions.  The  training      The first and second components in (1) correspond,
          environment  considered  here  is  a  network        respectively, to the SLA satisfaction ratio γSLA(k) of
          simulator  that  mimics  the  behavior  of  the  real   the slice k and the aggregate for the rest of slices
          network  when  varying  the  offered  load  of  the   k’≠k.  Specifically,  γSLA(k)  is  the  ratio  between  the
          different  slices  in  the  different  cells  and  when   aggregate throughput obtained by the slice across
          modifying  the  resource  usage  quota  allocated  to   all  cells  T(k)  and  the  minimum  between  the
          each slice as a result of the actions made by the DQN   aggregate   offered   load   A(k)   and    the
          agents.  In  this  respect,  the  simulator  is  fed  by   dlThptPerSlice(k) term of the SLA and is computed
          training data consisting of multiple time patterns of   as:
          the required capacity (i.e. offered load) of the slices
          in the different cells. This data can be either built          (  ) =        (      (  )      , 1)   (2)
          synthetically  or  extracted  from  real  network                              (dlThptPerSlice(  ),  (  ))
          measurements.  The  training  is  assumed  to  be    where A(k) is the aggregate across all the cells of the
          executed in a training host, located at the SMO, with   per-cell offered load O(k,n) of slice k bounded by the
          the  necessary  libraries,  supporting  tools  and   limit  established  by  the  TermDensity(k)  and
          computational  capabilities  for  training  the  DQN   dlThptPerUe(k) parameters of the SLA in the service
          models and running the simulator.                    area S(n) of each cell n, that is:
          For  carrying  out  the  training  process,  each  DQN      (  ) = ∑           (  (  ,   ), dlThptPerUe(  ) ·
                                                                              =1
          agent is composed by three different elements: (i)        TermDensity(  ) ·   (  ))                (3)
          The  evaluation  NN,  which  corresponds  to  the
          Qk(s(k),a(k),k)  being  learnt  that  will  eventually   The third component of the reward is the capacity
          determine  the  policy  to  be  applied  at  the  ML   utilization  factor,  γu(k),  which aims  at  minimizing
          inference host. (ii) The target NN, which is another   the over-provisioning of capacity and is defined as
          NN with the same structure as the evaluation NN      the  ratio  between  the  aggregate  throughput  T(k)
          but with weights k . It is used for obtaining the so-  obtained by the slice and the total capacity allocated
                            -
          called  Time  Difference  (TD)  target  required  for   to the slice across all cells, that is:
          updating  the  evaluation  NN.  (iii)  The  experience                 (  ) =     (  )             (4)
          data set (ED), which stores the experiences of the                         ∑       (  )·  (  ,  )
                                                                                        =1
          agent  resulting  from  the  interactions  with  the   where C(n) is the capacity of cell n.
          training environment as explained in the following.
                                                               The  reward  r(k)  is  provided  by  the  training
          The  interactions  between  the  DQN  agent  and  the   environment to the DQN agent at the end of each
          training  environment  occur  in  time  steps  of    time step and, correspondingly, the T(k) and A(k)
          (simulated time) duration t. In each time step the   values  correspond  to  average  values  during  the
          DQN agent of the k-th slice observes the state s(k) in   time step.
          the training environment and selects an action a(k).
          Action selection is based on an ε-Greedy policy that,   As a result of the interactions between the training
          with  probability  1-ε,  chooses  the  action  that   environment and the DQN agent, each experience of
          maximizes  the  output  of  the  evaluation  NN,  and,   the ED is represented by a tuple that includes the
          with probability ε, chooses a random action. As a    state observed at the beginning of a given time step,
          result of applying the selected action, the training   the selected action, the obtained reward as a result
          environment  generates  a  reward  value  r(k)  that   of this action and the new state observed at the end
          assesses  how  good  the  action  was  from  the     of the time step duration.
          perspective of the desired behavior. In particular, in   The experiences stored in the ED are used by the
          the considered approach the reward captures both     DQN agent to progressively update the values of the
          the SLA satisfaction and the capacity utilization. In   weights k and  k  in the evaluation and target NNs,
                                                                                 -
          this  way,  the  reward  for  slice  k  is  defined  as  the   respectively. For each time step, the update of the
          weighted product of three terms given by:            weights  k  of  the  evaluation  NN  is  performed  by
                                              2
                                 
                        
                                                     
            (  ) =           (  ) 1  · (  1  ∑ ′            (  ′))  ·    (  ) 3     (1)   randomly selecting a mini batch of experiences of
                                                 
                             −1     =1                         the ED and updating the weights of the evaluation
                                 ′≠  
                                                               NN k according to the mini-batch gradient descent

          112                                © International Telecommunication Union, 2020
   127   128   129   130   131   132   133   134   135   136   137