Page 66 - ITU Journal Future and evolving technologies Volume 2 (2021), Issue 4 – AI and machine learning solutions in 5G and future networks

P. 66

ITU Journal on Future and Evolving Technologies, Volume 2 (2021), Issue 4

′
Table 1 – Learning Parameters and now insert Δ and Δ , the difference between cur‑
′
rent and initial values of and respectively, into the
Parameter Value update error:
Learning rate 0.01 Δ ( , ) = ( , ) − ( , )
0
Reward decay rate 0.9 { Δ ( , ) = ( , ) − ( , ) − Φ( ) (11)
′
′
Minimum exploration rate 0.9 0
Number of episodes 22
Number of steps in each episode 40 we have
′
′
′
Number of states 30 = , + Φ( ) − Φ( ) + ( ( , )
0
Number of actions 855 ′
′
′
+ Δ ( , )) − ( , ) − Δ ( , )
0
′
′
′
′
′
= , + (Φ( ) + ( , ) + Δ ( , ))
0
4.2 Reward signals ′
− ( , ) − Δ ( , ) − Φ( )
0
4.2.1 Reward design with Q‑initialization = ( , ) + ( , ) − ( , )
′
′
′
′
′
′
′
As discussed in [20], reward signals in our simulation en‑
= ′
vironment are crucial to the RL Markov Decision Process
(12)
(MDP) since agents are expected to learn the optimal pol‑
Therefore, we investigate the relationship between
icy under industrial criteria.
and ( , ) to decide the form of . In the MDP prob‑
0
Since adding additional rewards follows the policy invari‑ lem setting [18], the discounted return from time step
∞

ance [20], the reward function ( , ) within our problem is = ∑ =0 + +1 , and since ∈ (0, 1), if is

setting consists of two main parts: formed as a bounded series based on the distance from
to : ( , ) ≤ , where ≤ 1, we have

0
( , ) = ( , ) + ( , ) (8)
0
∞

( , ) is given to the agent if is approached and = ∑ + +1

0
0
( , ) works as intermediate reward in the training =0
∞
process when ≠ . ≤ ∑

0
We aim to construct reward shaping for ( , ) us‑ =0 (13)
ing the potential‑based method to help guide the agent in ∞
≤ ∑
MDP; the potential‑based shaping function is de ined as
=0
[20]:
=
De inition 1 Let any , , and any shaping reward func‑ 1 −
tion ∶ × × → ℝ in MDP be given. is potential‑ then for optimal policy [18]
′
based if there exists a real‑valued function Φ ∶ → ℝ . .
′

( , ) = [ ] ≤ (14)

′
′
( , , ) = Φ( ) − Φ( ) (9)
we know and satisfy:
′
for all ≠ , ∈ , ∈ .
0
≤ (15)
Therefore, based on the results in [20], such an can 1 −
guarantee consistency with the optimal policy that the (15) gives an explicit gap between the two parts of ( , )
agent learned. Luckily, there is no need to construct the and also directly in luences the following initialization of
shaping function from scratch [21], since the design of
( , ).
is equivalent to the initialization of [Q] , .
4.2.2
Suppose the optimal policies learnt in our model with and Q‑initialization setting
′
without potential‑based are and , respectively. Let ′
initial function of be ( , ) = ( , ) with shap‑ We rewrite the initial table of policy and the inal
′
0
′

′ ′ converged table as 0 and respectively. By ((7)),
ing rewards Φ( ) − Φ( ), and initial function of be
′
( , ) = ( , ) + Φ( ) with no shaping rewards.
0
′
′

( , ) ← ′ + ( + 0 ′ − )
0
0
By (7), we have the update error: ′ ′ ′ (16)

= + (1 − )( − )
0
0

′
′
′
= , + Φ( ) − Φ( ) + ( , ) − ( , )
{ ′ we can derive that
′
′
′
′
′ = , + ( , ) − ( , )
′ ′ ′
(10) 0 > = 1 − (17)
50 © International Telecommunication Union, 2021

61 62 63 64 65 66 67 68 69 70 71