Page 66 - ITU Journal Future and evolving technologies Volume 2 (2021), Issue 4 – AI and machine learning solutions in 5G and future networks
P. 66

ITU Journal on Future and Evolving Technologies, Volume 2 (2021), Issue 4



                                                                                       ′
                       Table 1 – Learning Parameters           and now insert Δ   and Δ   , the difference between cur‑
                                                                                             ′
                                                               rent and initial values of    and    respectively, into the
               Parameter                       Value           update error:
               Learning rate                    0.01                  Δ  (  ,   ) =   (  ,   ) −    (  ,   )
                                                                                          0
               Reward decay rate                0.9                {  Δ   (  ,   ) =    (  ,   ) −    (  ,   ) − Φ(  )  (11)
                                                                         ′
                                                                                  ′
               Minimum exploration rate            0.9                                     0
               Number of episodes               22
               Number of steps in each episode     40          we have
                                                                                  ′
                                                                                                         ′
                                                                                                      ′
               Number of states                 30                             =      ,    +   Φ(   ) − Φ(  ) +          (   (   ,    )
                                                                                                   0
               Number of actions                855                                              ′
                                                                               ′
                                                                             ′
                                                                      + Δ  (   ,    )) −    (  ,   ) − Δ  (  ,   )
                                                                                      0
                                                                                       ′
                                                                                                          ′
                                                                                               ′
                                                                                                 ′
                                                                                                             ′
                                                                      =      ,    +          (Φ(   ) +    (   ,    ) + Δ  (   ,    ))
                                                                                            0
          4.2  Reward signals                                                      ′
                                                                      −    (  ,   ) − Δ  (  ,   ) − Φ(  )
                                                                         0
          4.2.1   Reward design with Q‑initialization                 =   (  ,   ) +              (   ,    ) −    (   ,    )
                                                                                                      ′
                                                                                                    ′
                                                                                            ′
                                                                                       ′
                                                                                         ′
                                                                                                 ′
                                                                                     ′
          As discussed in [20], reward signals in our simulation en‑
                                                                      =    ′
          vironment are crucial to the RL Markov Decision Process                   
                                                                                                            (12)
          (MDP) since agents are expected to learn the optimal pol‑
                                                               Therefore, we investigate the relationship between              
          icy under industrial criteria.
                                                               and    (  ,   ) to decide the form of               . In the MDP prob‑
                                                                    0
          Since  adding  additional  rewards  follows  the  policy  invari‑   lem setting [18], the discounted return from time step   
                                                                         ∞
                                                                               
          ance [20], the reward function   (  ,   ) within our problem   is    = ∑   =0          +  +1 , and since    ∈ (0, 1), if                is
                                                                     
          setting consists of two main parts:                  formed as a bounded series based on the distance from
                                                                  to    :   (  ,   )             ≤               , where                ≤ 1, we have
                                                                       
                                                                0
                     (  ,   ) =   (   ,   )           +   (  ,   )             (8)
                              0
                                                                                     ∞
                                                                                           
            (   ,   )           is given to the agent if    is approached and      = ∑         +  +1
                                                                                   
                                        0
            0
            (  ,   )             works as intermediate reward in the training          =0
                                                                                     ∞
          process when    ≠    .                                                  ≤ ∑                 
                                                                                           
                          0
          We aim to construct reward shaping for   (  ,   )             us‑            =0                   (13)
          ing the potential‑based method to help guide the agent in                        ∞    
                                                                                  ≤       ∑   
          MDP; the potential‑based shaping function is de ined as                              
                                                                                            =0
          [20]:                                                                                   
                                                                                  =
          De inition 1 Let any   ,   ,    and any shaping reward func‑               1 −   
          tion    ∶    ×    ×    → ℝ in MDP be given.    is potential‑  then for optimal policy    [18]
                                                                                    ′
          based if there exists a real‑valued function Φ ∶    → ℝ   .  .
                                                                                 ′
                                                                                  
                                                                                    (  ,   ) =   [   ] ≤              (14)
                                                                                               
                                     ′
                             ′
                        (  ,   ,    ) =   Φ(   ) − Φ(  )  (9)                 
                                                               we know       and     satisfy:
                      ′
          for all    ≠    ,    ∈   ,    ∈   .                                               
                   0
                                                                                                 ≤          (15)
          Therefore, based on the results in [20], such an    can                 1 −              
          guarantee consistency with the optimal policy that the  (15) gives an explicit gap between the two parts of   (  ,   )
          agent learned. Luckily, there is no need to construct the  and also directly in luences the following initialization of
          shaping function from scratch [21], since the design of   
                                                                 (  ,   ).
          is equivalent to the initialization of [Q]   ,   .
                                                               4.2.2
          Suppose the optimal policies learnt in our model with and    Q‑initialization setting
                                     ′
          without potential‑based    are    and   , respectively. Let                              ′
          initial    function of    be   (  ,   ) =    (  ,   ) with shap‑  We rewrite the initial    table of policy    and the  inal
                                                                                  ′
                                           0
                                                                                         ′
                                                                                          
                                                                                   
                        ′                              ′       converged table as    0  and                 respectively. By ((7)),
          ing rewards   Φ(   ) − Φ(  ), and initial    function of    be
           ′
             (  ,   ) =    (  ,   ) + Φ(  ) with no shaping rewards.
                    0
                                                                                                       ′
                                                                      ′
                                                                                                        
                                                                       
                                                                      (  ,   ) ←       ′  +   (               +         0 ′  −    )
                                                                                                      0
                                                                                0
          By (7), we have the update error:                                     ′             ′       ′     (16)
                                                                                                       
                                                                            =        +   (1 −   )(       −    )
                                                                               0
                                                                                                     0
                                                                                                       
                              ′
                                                 ′
                                                    ′
                           =      ,    +   Φ(   ) − Φ(  ) +             (   ,    ) −   (  ,   )
          {                                   ′                we can derive that
                                      ′
                                   ′
                                            ′
                                 ′
               ′             =      ,    +              (   ,    ) −    (  ,   )
                               ′                                                  ′     ′                
                                                      (10)                       0  >                =  1 −     (17)
          50                                 © International Telecommunication Union, 2021
   61   62   63   64   65   66   67   68   69   70   71