Page 131 - ITU Journal, Future and evolving technologies - Volume 1 (2020), Issue 1, Inaugural issue
P. 131

ITU Journal on Future and Evolving Technologies, Volume 1 (2020), Issue 1




          framework).  The  computation  of  (k,n)  and       followed by the different slices in order to maximize
          potential  update  of  the  rRMPolicyDedicatedRatio   the cumulative reward are learnt offline by the ML
          attribute  is  done  every  t  minutes.  The        training  host  who  provides  them  to  the  ML
          determination of the resource usage quota (k,n) is   inference host.  Further  details  about  this  training
          realized through the following functions:            process  and  the  reward  formulation  are  given  in
                                                               Section 4.2.
          •  Per-slice action selection policies
                                                               •  Resource usage quota computation
          The  action  selection  policy  of  slice  k  gets  the
          network state s(k) observed for this slice at the time   This  function  computes  the  value  of  the  resource
          when  the  policy  is  executed  and  determines  the   usage quota σ(k,n) to be allocated to each slice and
          action a(k) to be applied for this slice. The action a(k)   cell  for  the  next  time  step  by  applying  the
          is composed of N per-cell actions that take one out   increase/maintain/decrease  actions  provided  by
          of three possible values corresponding to: increase   the  action  selection  policies  of  all  the  slices  and
          the resource usage quota (k,n) for slice k in cell n   configures the resulting σ(k,n) values in the O-DU
          in an amount of Δ for the next time step, maintain   through the rRMPolicyDedicatedRatio attribute. To
          the same resource usage quota or decrease it in an   make  the  configuration  on  a  per-slice  basis,  an
          amount of Δ.                                         rRMPolicyMemberList is specified for each RAN slice,
                                                               being  composed  of  a  single  member  with  the  S-
          In turn, the state s(k) includes N different per-cell   NSSAI  and  PLMNid  of  the  RAN  slice.  Then,  the
          components,  each  one  given  by  the  triple       rRMPolicyDedicatedRatio    is   configured   per
          <ρ(k,n),σ(k,n), σava(n)>  where ρ(k,n) is the fraction   rRMPolicyMemberList in each cell.
          of PRBs occupied by the slice k in cell n, σ(k,n) is the
          current resource usage quota allocated to the slice   When  applying  the  actions,  this  function  ensures
          and  σava(n)  is  the  total  amount  of  resource  usage   that the maximum cell bit rate value associated to
          quota in the cell not allocated to any slice. While the   the termDensity and dlThptPerUe parameters is not
          values of σ(k,n) and σava(n) are directly available at   exceeded.  Moreover,  since  the  action  selection
          the RAN cross-slice manager, the value of ρ(k,n) is   policies   for   the   different   slices   operate
          obtained  from  the  different  cells  through  the   independently,  this  function  also  checks  that  the
          performance management (PM) services offered by      aggregated resource usage quota for all the slices in
          the  O1  interface.  In  particular,  using  the     a cell after applying the actions does not exceed 1 in
          performance  measurements  defined  in  [49],  it    order not to exceed the cell capacity. If this happens,
          corresponds to the ratio between the “DL PRB used    it applies first the actions of the slices involving a
          for  data  traffic”,  which  measures  the  number  of   reduction  or  maintenance  of  the  resource  usage
          PRBs used in average for data traffic in a given slice   quota  and  the  remaining  capacity  is  distributed
          and  cell,  and  the  “DL  total  available  PRB”,  which   among  the  slices  that  have  increase  actions.  This
          measures the number of available PRBs in the cell.   distribution is proportional to their dlThptPerSlice
          Both measurements are collected from the gNB-DU      values,  as  long  as  their  current throughput  is  not
          every time step, so their average is performed along   already  higher  than  the  dlThptPerSlice.  For  doing
          the time step duration t.                           this adjustment, the measured throughput per slice
                                                               across all the cells in the last time step is needed. It
          Following the DQN approach, the action selection     can  be  obtained  from  the  PM  services  of  the  O1
          policy  (k)  of  the  k-th  slice  seeks  to  maximize  a   interface  using  the  “Downstream  throughput  for
          cumulative  reward  that  captures  the  desired     Single  Network  Slice  Instance”  Key  Performance
          optimization target to be achieved. In particular, the   Indicator of [50].
          action selection policy (k) for a given state s(k) is
                                 Q (s(k),a(k),θ ) ,  where
          defined  as  argmax a(k) k          k                4.2  Trainer  of  RAN  cross-slice  management
          Qk(s(k),a(k),k) is the output of a deep NN for the        policies
          input  state  s(k)  and  the  output  action  a(k),   This component constitutes the training part of the
          providing  the  maximum  expected  cumulative        DQN model intended to learn the NN parameters k
          reward  starting  at  s(k)  and  triggering  a(k).  The   that  determine  the  per-slice  action  selection
          internal  structure  of  the  NN  is  specified  by  the   policies to be used by the RAN cross-slice manager.
          vector of parameters k that contains the weights of   The  training  process  makes  use  of  a  multi-agent
          the  different  neuron  connections.  The  optimum   DQN approach in which each DQN agent learns the
          values  of  k  that  determine  the  policies  to  be
                                                               optimum  policy  of  a  different  RAN  slice  by




                                             © International Telecommunication Union, 2020                   111
   126   127   128   129   130   131   132   133   134   135   136