Page 67 - ITU Journal Future and evolving technologies Volume 2 (2021), Issue 4 – AI and machine learning solutions in 5G and future networks
P. 67

ITU Journal on Future and Evolving Technologies, Volume 2 (2021), Issue 4



          to guarantee the convergence of the model update.    to make up for the delay in (18). Therefore, the controller
          And under the Q‑learning scheme, (17) always provide  plays  a  highly  ef icient  role  as  the  penalty  signal  in  our re‑
          chances of exploration for actions that have not been at‑  ward and serves as a reinforced mechanism to assist the
          tempted.                                             selection.  The  upper  bound  of  the  time  complexity for the
                                                               dynamic Q‑learning method is in   (    ) [22].
          In this end, we give reward signals for   (  ,   ) as follows:
                          ⎧ −     0.1⋅(  −2)     ≥             For  a  total  of  at  most      trials  in      episodes  with  a   ixed
                          {       2.8 +1      2                initial environment setting, algorithm 1 will stop training
                          {
                            ,    ∶=  ⎨  −    0.01     =    1  (18)  the  agent  once     is  approached  rather  than  continuing
                                                                              0
                                            
                                2.8
                                 +1
                          {                                    the  process  due  to  the  reward  signals  design  in  our model:
                          {    0.2⋅(30−  )
                          ⎩      2.8 +1     =    0
                                            
                                                               Controller  C:  As  shown  in  Algorithm  1,  controller  C  will
                                                               shrink the action space related to    in every step    based
                                                      0.01
          here    ≥    means    ≥ 2 and set                = −     2.8 +1  on the double   - greedy principle. This operation enables
                     2
                 
          from (18) to follow the conditions we derived in (13), (15).  the optimal action selection with higher and higher prob‑
          Therefore, we can initialize the    function as [Q]   ,    ∶=  ability as    goes on.
          0 |  |×|  |  to satisfy (17).
                                                               Reward       ,   :    (18)  guarantees  the  agent  learns  a  global op-
                                                               timum,  our  target  action,  instead  of  continuously jumping
          Algorithm 1 Optimal Action Selection Control
                                                               on  some  local  optimum  for  meaningless rewarding [23].
          Input: Initial CDF state              and target state    .  Reward  signals  and  controller  C  attempt  to  guide  the
                                                 0
          Output: Optimal    to approach    during episode   .  agent by avoiding redundant scoring and long term penal‑
                                     0
                                                               ties.  The  agent  itself  continuously  updates  the  learning
           1: De ine customized   ,   ,    and   .             policy under the guidance of both of them.
           2: Initialize C ∶= { }, Q ∶= 0 |  |×|  | ,    ∶= 0
           3: Initialize    ∶=             ,    ∶= 0           4.4  Other existing methods
           4: repeat
           5:   while    <    do                               For  the  not  too  large    ×    space  de ined  in  Section  4.1,
           6:         ∶=       (          ,            +    ⋅   /(   ⋅   ))  the  MC  exhaustion  algorithm  often  serves  as  a  baseline
                           ′                                   solution for the problem in Section 3. It requires testing
           7:      Sample    ,    ∼   (0, 1)                   on  all  possible      ∈      to  ensure  the  best  action  among
                           1
                             2
           8:      if    ≤    then                             space   .
                      1
           9:         if    >   (1 −   ) then
                         2
                                                        ′
          10:            Select    ∈    − C,    = arg          (  ,    )  Therefore, we apply classical model‑free RL methods:  Q‑
                                                   ′
          11:         else                                     learning  (off‑policy)  and  Sarsa  (on‑policy)  [18]  in  this
                                                    ′
          12:            Select    ∈   ,    = arg          (  ,    )  problem setting.  They differ mainly in the Q function up‑
                                               ′               dating  style,  while  Q‑learning  holds  ((7)),  Sarsa  follows
                      end
                                                               the update below:
          13:      else
                                                                                               ′
                                                                                                  ′
          14:         Select    ∈    − C randomly                   (  ,   ) ←   (  ,   ) +   [    +     (   ,    ) −   (  ,   )]  (19)
                   end                                                                   ,  
                                                 ′
          15:      Perform    in the simulator obtain    ,   (  ,   )
                                                               Parameters for these models are set the same as in Table 1.
          16:      Update the entry   (  ,   ) as in (7)       Unsurprisingly, off‑policy based methods are superior to
                        ′
          17:         ←    ,    ←    + 1                       on‑policy  methods  [18]  in  the  experiment  discussed
          18:      if    ≠    then                             later.
                         0
          19:         Append    in C
          20:      else                                        With the experience gained from Algorithm 1, Algorithm 2
          21:         Early stopping                           is  proposed  to  test  the  trained  agent’s  policy  with  any
          22:         return                                   randomized given             .
          23:      end
          24:   end while
          25: until    =    otherwise proceed to episode    + 1  5.  SIMULATIONS AND DISCUSSIONS
                      0
                                                               To  thoroughly  investigate  the  performance  of  the  pro‑
                                                               posed RL‑assisted full dynamic beamforming method and
          4.3 Dynamic Q‑Learning algorithm
                                                               validate the effectiveness of the theoretical analysis previ‑
          Considering the computational and equipment cost in an  ously, we present statistical results of SINRs and compu‑
          MMIMO system, the delaying effect of reward should be  tational complexity of the proposed algorithm compared
          minimized. Then after each step   , we use twice   ‑greedy  to  other  industrial  methods.  We  implement  Algorithm 1
          strategy, the controller to help avoid the action that is un‑  within  the  environment  below  with  preset  parameters
          related to    to dynamically shrink the    space in order  shown in both Table 1 and Table 2.
                   0
                                             © International Telecommunication Union, 2021                    51
   62   63   64   65   66   67   68   69   70   71   72