Page 67 - ITU Journal Future and evolving technologies Volume 2 (2021), Issue 4 – AI and machine learning solutions in 5G and future networks
P. 67
ITU Journal on Future and Evolving Technologies, Volume 2 (2021), Issue 4
to guarantee the convergence of the model update. to make up for the delay in (18). Therefore, the controller
And under the Q‑learning scheme, (17) always provide plays a highly ef icient role as the penalty signal in our re‑
chances of exploration for actions that have not been at‑ ward and serves as a reinforced mechanism to assist the
tempted. selection. The upper bound of the time complexity for the
dynamic Q‑learning method is in ( ) [22].
In this end, we give reward signals for ( , ) as follows:
⎧ − 0.1⋅( −2) ≥ For a total of at most trials in episodes with a ixed
{ 2.8 +1 2 initial environment setting, algorithm 1 will stop training
{
, ∶= ⎨ − 0.01 = 1 (18) the agent once is approached rather than continuing
0
2.8
+1
{ the process due to the reward signals design in our model:
{ 0.2⋅(30− )
⎩ 2.8 +1 = 0
Controller C: As shown in Algorithm 1, controller C will
shrink the action space related to in every step based
0.01
here ≥ means ≥ 2 and set = − 2.8 +1 on the double - greedy principle. This operation enables
2
from (18) to follow the conditions we derived in (13), (15). the optimal action selection with higher and higher prob‑
Therefore, we can initialize the function as [Q] , ∶= ability as goes on.
0 | |×| | to satisfy (17).
Reward , : (18) guarantees the agent learns a global op-
timum, our target action, instead of continuously jumping
Algorithm 1 Optimal Action Selection Control
on some local optimum for meaningless rewarding [23].
Input: Initial CDF state and target state . Reward signals and controller C attempt to guide the
0
Output: Optimal to approach during episode . agent by avoiding redundant scoring and long term penal‑
0
ties. The agent itself continuously updates the learning
1: De ine customized , , and . policy under the guidance of both of them.
2: Initialize C ∶= { }, Q ∶= 0 | |×| | , ∶= 0
3: Initialize ∶= , ∶= 0 4.4 Other existing methods
4: repeat
5: while < do For the not too large × space de ined in Section 4.1,
6: ∶= ( , + ⋅ /( ⋅ )) the MC exhaustion algorithm often serves as a baseline
′ solution for the problem in Section 3. It requires testing
7: Sample , ∼ (0, 1) on all possible ∈ to ensure the best action among
1
2
8: if ≤ then space .
1
9: if > (1 − ) then
2
′
10: Select ∈ − C, = arg ( , ) Therefore, we apply classical model‑free RL methods: Q‑
′
11: else learning (off‑policy) and Sarsa (on‑policy) [18] in this
′
12: Select ∈ , = arg ( , ) problem setting. They differ mainly in the Q function up‑
′ dating style, while Q‑learning holds ((7)), Sarsa follows
end
the update below:
13: else
′
′
14: Select ∈ − C randomly ( , ) ← ( , ) + [ + ( , ) − ( , )] (19)
end ,
′
15: Perform in the simulator obtain , ( , )
Parameters for these models are set the same as in Table 1.
16: Update the entry ( , ) as in (7) Unsurprisingly, off‑policy based methods are superior to
′
17: ← , ← + 1 on‑policy methods [18] in the experiment discussed
18: if ≠ then later.
0
19: Append in C
20: else With the experience gained from Algorithm 1, Algorithm 2
21: Early stopping is proposed to test the trained agent’s policy with any
22: return randomized given .
23: end
24: end while
25: until = otherwise proceed to episode + 1 5. SIMULATIONS AND DISCUSSIONS
0
To thoroughly investigate the performance of the pro‑
posed RL‑assisted full dynamic beamforming method and
4.3 Dynamic Q‑Learning algorithm
validate the effectiveness of the theoretical analysis previ‑
Considering the computational and equipment cost in an ously, we present statistical results of SINRs and compu‑
MMIMO system, the delaying effect of reward should be tational complexity of the proposed algorithm compared
minimized. Then after each step , we use twice ‑greedy to other industrial methods. We implement Algorithm 1
strategy, the controller to help avoid the action that is un‑ within the environment below with preset parameters
related to to dynamically shrink the space in order shown in both Table 1 and Table 2.
0
© International Telecommunication Union, 2021 51