Page 65 - ITU Journal Future and evolving technologies Volume 2 (2021), Issue 4 – AI and machine learning solutions in 5G and future networks
P. 65

ITU Journal on Future and Evolving Technologies, Volume 2 (2021), Issue 4




         where    is the path‑loss exponent,    ⋅    is the transmit  First, we de ine the agent to be the MMIMO system, the
         power of the serving enode    ,    is the number of neigh‑  set  of  states      ≜  {   }   −1   to  be  the  levels  of  average  re‑
                                    
                                                                                      =0
         boring enode    ,    is the transmit power from    ,    is  gional  SINR,  the  set  of  actions      ≜  {   }   −1   to  be  the
                                                                                                       =0
                                                      ⋅  
                                                      
                         
                           
         the distance of the UE to the serving station,    is the dis‑  possible combinations of antenna parameters. More pre‑
                                                  
         tance of the UE to each of the neighboring stations, and  cisely,  each     is  an  interval  of  the  SINR  value,     is  the
                                                                                                          0
                                                                             
               is the background noise with    the thermal noise  optimal  SINR  value  interval,  i.e.  the  highest  achievable
           0
                                         0
         and    the system bandwidth.                          SINR  value  derived  from  expert  experiences  in  the
                                                               current  environment.  Similarly,       −1   is  the  lowest  SINR
         According to [16], the UE density    (  )  in (4) is assumed to
                                       0                       range.  And  as      increases,  the  boundary  values  of       
         follow the independently and identically distributed two‑  decreases,  thus  a  higher      implies  poorer  signal
         dimensional Poisson point process. The number of users  performance  state     .        Each  action     is  an  antenna
                                                                                                    
            0 (  )  of the target cell with area    is given by  parameter’s  choice  made  by  the  MMIMO  system,  and
                                      0
                                                               consists  of  azimuth,  vertical  angle  and  beam  width.  The
                                     (  )
                                            
                                          )
                     {   0 (  )  =   |   } =  (   0    !  0     −   (  )     0  (6)  environment  is  a  signal  simulator,  see  Section  5.1  for
                                              0
                              0
                                                               more  detail.  The  objective  is  to  approach  the  optimal
                                                               target  SINR  state     0   to  achieve  the  best  signal
                                                 ∗
         From (4) to (6)  the optimal parameters ⟨    ,    ⟩ can be   performance.  It  covers  the  probability  in  (4)  of  average
                                                   
         found, and the best weight Π can be derived by substi‑   regional  SINR  given  by  the  simulator  and  guided  by
                                    
         tuting (4)‑(6) into (1)‑(3).                          selected action   .   The environment (Fig. 1(c)) grants the
                                                               agent a reward      ,    after the latter takes an    ∈    when it
         4.  THE      PROPOSED        REINFORCEMENT            is in    ∈   .
                                               ∗
                                                 
             LEARNING ASSISTED BEAMFORMING                     Formally,  we  denote  the  state‑action  value  function,  the
                                                               expected discounted reward, as   (  ,   ).   In the table Q ∈
         Due to the lack of prior knowledge that is required to  ind
                                                                    ×   ,   we use notation [19]   (  ,   ) ≜ [Q]  and update
         the theoretical optimal solution of (4), some research has                                   ,  
                                                               entries by:
           conduct  over  relat  surrogat    optimization
                                                                                                   ′
                                                                                                     ′
         pr  Generally  apart  fr        Sarsa                     (  ,   ) ←   (  ,   ) +   [     ,    +             (   ,    ) −   (  ,   )]
         [17] and Q‑learning [13] are attempt  Those methods                                   ′
         lack convergence ef iciency in practice even though their                                           (7)
         convergence can be guaranteed [18].
                                                                where    ∶ 0 <    < 1 is the learning rate and    ∶ 0 <    < 1
                   Q‑learning  beamforming                      is the discount  factor and determines  the importance of
                                                                               ′
                                                                                     ′
         method is proposed to mitigate ICI and enhance conver‑   future  rewards.     and     are  the  next  state  and  action,
             iciency          user      a                       respectively.
         dense  Urban‑eMBB  transmission  environment    esti‑
         mates the Probability Density Function (PDF) of users’ oc‑   An episode is a period of time in which an interaction be‑
         curences to achieve an optimal beamforming solution via   tween the environment and the agent takes place. Here,
         trial without knowledge of the network and transmission   an episode is of (at most)     transitional discrete time step
         channel.                                                 . During an episode    ∶    ∈ {0, 1, … ,   }, the agent makes a
                                                                decision  to  maximize  the  effects  of  actions  decided  by
           RL‑based beamforming                                 itself. To achieve this goal, we apply the    - greedy learning
                                                                strategy  to  balance  exploration  and  exploitation,  where
             RL‑based  beamforming  process    shown    Fig.    1  −      ∶  0  <      <  1  is the exploration rate and serves as the
         1(a)(b), the BS in the target cell estimates the probability  threshold  probability  to  select  a  random      ∈    , as  op-
                 (  )
         density     of users’ occurrences in the target small cell  posed to selecting an action based on exploitation. To add
                 0
         #0 by a long‑term data statistical analysis in (6) at time   randomness, the     increases in every episode from          
         slot   . Once all served users send SINRs    (  −1)   at the time  until it reaches a preset upper bound.
                                      (  )
         slot (   − 1) to the BS, the state [   ,    (  −1) ] observed by
                                      0
         the BS at the time slot    is obtained, and then an RL‑based   The      space  is  constructed  by  partitioning  the  range  of
         beamforming algorithm is applied for searching the op‑   the  process  Cumulative  Distribution  Function  (CDF),
         timal parameters for the ICI mitigation and coverage op‑   which  is  the  probability  of  users  with  SINR  under  the
           We formulate the beamforming optimization            given    in (4). The components of     space is shown in
                                                                        
                                                                                                 ′
         problem  under  the  MMIMO  system  context  as  an  RL   Table 1. Through a  inite series of    ∈    ∶=    − C (will be
                                                                                                              0
         problem  and  therefore  provide  a  dynamic  Q-learning   discussed  later),  the  agent  attempts  to  approach     in
                                                                                      
         scheme to address the issue.                           response to simulated    at step    within an episode.
                                             © International Telecommunication Union, 2021                    49
   60   61   62   63   64   65   66   67   68   69   70