Page 65 - ITU Journal Future and evolving technologies Volume 2 (2021), Issue 4 – AI and machine learning solutions in 5G and future networks

P. 65

ITU Journal on Future and Evolving Technologies, Volume 2 (2021), Issue 4

where is the path‑loss exponent, ⋅ is the transmit First, we de ine the agent to be the MMIMO system, the
power of the serving enode , is the number of neigh‑ set of states ≜ { } −1 to be the levels of average re‑

=0
boring enode , is the transmit power from , is gional SINR, the set of actions ≜ { } −1 to be the
=0
⋅

the distance of the UE to the serving station, is the dis‑ possible combinations of antenna parameters. More pre‑

tance of the UE to each of the neighboring stations, and cisely, each is an interval of the SINR value, is the
0

is the background noise with the thermal noise optimal SINR value interval, i.e. the highest achievable
0
0
and the system bandwidth. SINR value derived from expert experiences in the
current environment. Similarly, −1 is the lowest SINR
According to [16], the UE density ( ) in (4) is assumed to
0 range. And as increases, the boundary values of
follow the independently and identically distributed two‑ decreases, thus a higher implies poorer signal
dimensional Poisson point process. The number of users performance state . Each action is an antenna

0 ( ) of the target cell with area is given by parameter’s choice made by the MMIMO system, and
0
consists of azimuth, vertical angle and beam width. The
( )

)
{ 0 ( ) = | } = ( 0 ! 0 − ( ) 0 (6) environment is a signal simulator, see Section 5.1 for
0
0
more detail. The objective is to approach the optimal
target SINR state 0 to achieve the best signal
∗
From (4) to (6) the optimal parameters ⟨ , ⟩ can be performance. It covers the probability in (4) of average

found, and the best weight Π can be derived by substi‑ regional SINR given by the simulator and guided by

tuting (4)‑(6) into (1)‑(3). selected action . The environment (Fig. 1(c)) grants the
agent a reward , after the latter takes an ∈ when it
4. THE PROPOSED REINFORCEMENT is in ∈ .
∗

LEARNING ASSISTED BEAMFORMING Formally, we denote the state‑action value function, the
expected discounted reward, as ( , ). In the table Q ∈
Due to the lack of prior knowledge that is required to ind
× , we use notation [19] ( , ) ≜ [Q] and update
the theoretical optimal solution of (4), some research has ,
entries by:
conduct over relat surrogat optimization
′
′
pr Generally apart fr Sarsa ( , ) ← ( , ) + [ , + ( , ) − ( , )]
[17] and Q‑learning [13] are attempt Those methods ′
lack convergence ef iciency in practice even though their (7)
convergence can be guaranteed [18].
where ∶ 0 < < 1 is the learning rate and ∶ 0 < < 1
Q‑learning beamforming is the discount factor and determines the importance of
′
′
method is proposed to mitigate ICI and enhance conver‑ future rewards. and are the next state and action,
iciency user a respectively.
dense Urban‑eMBB transmission environment esti‑
mates the Probability Density Function (PDF) of users’ oc‑ An episode is a period of time in which an interaction be‑
curences to achieve an optimal beamforming solution via tween the environment and the agent takes place. Here,
trial without knowledge of the network and transmission an episode is of (at most) transitional discrete time step
channel. . During an episode ∶ ∈ {0, 1, … , }, the agent makes a
decision to maximize the effects of actions decided by
RL‑based beamforming itself. To achieve this goal, we apply the - greedy learning
strategy to balance exploration and exploitation, where
RL‑based beamforming process shown Fig. 1 − ∶ 0 < < 1 is the exploration rate and serves as the
1(a)(b), the BS in the target cell estimates the probability threshold probability to select a random ∈ , as op-
( )
density of users’ occurrences in the target small cell posed to selecting an action based on exploitation. To add
0
#0 by a long‑term data statistical analysis in (6) at time randomness, the increases in every episode from
slot . Once all served users send SINRs ( −1) at the time until it reaches a preset upper bound.
( )
slot ( − 1) to the BS, the state [ , ( −1) ] observed by
0
the BS at the time slot is obtained, and then an RL‑based The space is constructed by partitioning the range of
beamforming algorithm is applied for searching the op‑ the process Cumulative Distribution Function (CDF),
timal parameters for the ICI mitigation and coverage op‑ which is the probability of users with SINR under the
We formulate the beamforming optimization given in (4). The components of space is shown in

′
problem under the MMIMO system context as an RL Table 1. Through a inite series of ∈ ∶= − C (will be
0
problem and therefore provide a dynamic Q-learning discussed later), the agent attempts to approach in

scheme to address the issue. response to simulated at step within an episode.
© International Telecommunication Union, 2021 49

60 61 62 63 64 65 66 67 68 69 70