Page 66 - ITU Journal Future and evolving technologies Volume 2 (2021), Issue 4 – AI and machine learning solutions in 5G and future networks
P. 66

ITU Journal on Future and Evolving Technologies, Volume 2 (2021), Issue 4

                       Table 1 – Learning Parameters           and now insert Δ   and Δ   , the difference between cur‑
                                                               rent and initial values of    and    respectively, into the
               Parameter                       Value           update error:
               Learning rate                    0.01                  Δ  (  ,   ) =   (  ,   ) −    (  ,   )
               Reward decay rate                0.9                {  Δ   (  ,   ) =    (  ,   ) −    (  ,   ) − Φ(  )  (11)
               Minimum exploration rate            0.9                                     0
               Number of episodes               22
               Number of steps in each episode     40          we have
               Number of states                 30                             =      ,    +   Φ(   ) − Φ(  ) +          (   (   ,    )
               Number of actions                855                                              ′
                                                                      + Δ  (   ,    )) −    (  ,   ) − Δ  (  ,   )
                                                                      =      ,    +          (Φ(   ) +    (   ,    ) + Δ  (   ,    ))
          4.2  Reward signals                                                      ′
                                                                      −    (  ,   ) − Δ  (  ,   ) − Φ(  )
          4.2.1   Reward design with Q‑initialization                 =   (  ,   ) +              (   ,    ) −    (   ,    )
          As discussed in [20], reward signals in our simulation en‑
                                                                      =    ′
          vironment are crucial to the RL Markov Decision Process                   
          (MDP) since agents are expected to learn the optimal pol‑
                                                               Therefore, we investigate the relationship between              
          icy under industrial criteria.
                                                               and    (  ,   ) to decide the form of               . In the MDP prob‑
          Since  adding  additional  rewards  follows  the  policy  invari‑   lem setting [18], the discounted return from time step   
          ance [20], the reward function   (  ,   ) within our problem   is    = ∑   =0          +  +1 , and since    ∈ (0, 1), if                is
          setting consists of two main parts:                  formed as a bounded series based on the distance from
                                                                  to    :   (  ,   )             ≤               , where                ≤ 1, we have
                     (  ,   ) =   (   ,   )           +   (  ,   )             (8)
            (   ,   )           is given to the agent if    is approached and      = ∑         +  +1
            (  ,   )             works as intermediate reward in the training          =0
          process when    ≠    .                                                  ≤ ∑                 
          We aim to construct reward shaping for   (  ,   )             us‑            =0                   (13)
          ing the potential‑based method to help guide the agent in                        ∞    
                                                                                  ≤       ∑   
          MDP; the potential‑based shaping function is de ined as                              
          De inition 1 Let any   ,   ,    and any shaping reward func‑               1 −   
          tion    ∶    ×    ×    → ℝ in MDP be given.    is potential‑  then for optimal policy    [18]
          based if there exists a real‑valued function Φ ∶    → ℝ   .  .
                                                                                    (  ,   ) =   [   ] ≤              (14)
                        (  ,   ,    ) =   Φ(   ) − Φ(  )  (9)                 
                                                               we know       and     satisfy:
          for all    ≠    ,    ∈   ,    ∈   .                                               
                                                                                                 ≤          (15)
          Therefore, based on the results in [20], such an    can                 1 −              
          guarantee consistency with the optimal policy that the  (15) gives an explicit gap between the two parts of   (  ,   )
          agent learned. Luckily, there is no need to construct the  and also directly in luences the following initialization of
          shaping function from scratch [21], since the design of   
                                                                 (  ,   ).
          is equivalent to the initialization of [Q]   ,   .
          Suppose the optimal policies learnt in our model with and    Q‑initialization setting
          without potential‑based    are    and   , respectively. Let                              ′
          initial    function of    be   (  ,   ) =    (  ,   ) with shap‑  We rewrite the initial    table of policy    and the  inal
                        ′                              ′       converged table as    0  and                 respectively. By ((7)),
          ing rewards   Φ(   ) − Φ(  ), and initial    function of    be
             (  ,   ) =    (  ,   ) + Φ(  ) with no shaping rewards.
                                                                      (  ,   ) ←       ′  +   (               +         0 ′  −    )
          By (7), we have the update error:                                     ′             ′       ′     (16)
                                                                            =        +   (1 −   )(       −    )
                           =      ,    +   Φ(   ) − Φ(  ) +             (   ,    ) −   (  ,   )
          {                                   ′                we can derive that
               ′             =      ,    +              (   ,    ) −    (  ,   )
                               ′                                                  ′     ′                
                                                      (10)                       0  >                =  1 −     (17)
          50                                 © International Telecommunication Union, 2021
   61   62   63   64   65   66   67   68   69   70   71