Page 101 - Proceedings of the 2018 ITU Kaleidoscope

P. 101

Machine learning for a 5G future

Figure 1 – The MEC-enabled LTE scenario.

its ability to learn through a trial and error process, which is utility through the concept of Q-value Q(s, a) which returns
very similar to the human being learning one, makes it the the value of doing action a when the agent is in the state s:
best choice to solve decision making problems. RL adopts, as
a basic model, the MDP formalism which is a framework for Q(s, a) = Q(s, a) + α(R(s + γmax a (Q(s , a ) − Q(s, a)) (2)
0
0
0
modeling the decision making in stochastic environments.
From a mathematical point of view, an MDP is deﬁned as
follows: U(s) = max a (Q(s, a)) (3)
where α is the learning rate and γ is the discount factor.
• set of states S that the environment can assume;
When the environment has a large number of states, it is
• set of actions A that the agent can perform; impossible to use traditional Q-learning. In fact, the time to
converge increases as the state space becomes larger, so if we
• probability transition matrix P; consider a scenario with a huge number of states (e.g. 10 20
or even more), it is impossible to think that the agent should
• reward function R(s) S : → R visit all of them multiple times to learn a good policy [9].
One way to address this kind of problem is the quantization
• discount factor γ ∈ [0,1] which deﬁnes the importance which consists in the process of grouping a set of states into a
we are giving to future rewards single one thus reducing the number of states which describe

In such a context, we deﬁne the ﬁgure of the RL agent whose an environment. However, such a technique is not helpful
objective is to ﬁnd an optimal policy in order to maximize the especially in those cases where the state space reduction
reward function in each state of the environment. Using would result in a too large quantization error that could lead
the MDP deﬁnition, the agent is able to run across the the agent to learn a wrong policy.
environment states several times and change the system A very interesting alternative is provided by the use of a Deep
policy to improve the reward accordingly. Bellman equation Neural Network (DNN) to create a function approximator
expresses the relationship between the utility of a state and capable to predict the Q-values for a given state without
its neighbors [9]: explicitly using eqs. ((2))-((3)) and the corresponding
MDP. Deep Reinforcement Learning is a technique which
Õ 0 0 has been pioneered by DeepMind [10]. The idea comes
P(s |s, a)U(s ) (1)
U(s) = R(s) + γ · max a∈A from the necessity to ﬁnd a new way to represent complex
s 0
environments, where the dimension of the state space and
where U(s) is the immediate reward obtained in the state s action space is very large making impossible to solve them
assuming that the agent will choose the optimal action. by using traditional approaches. The key idea at the base of
When the probability transition matrix is not known, a typical the deep reinforcement learning consists in the use of two
RL approach is the Q-learning [9], a model free technique separated DNNs parameterized with θ and θ respectively:
b
which tries to learn the relationship between the execution the Main DNN used to predict the Q values associated to a
of an action on a given state and the associated reward or generic state and the Target DNN used to generate the target

– 85 –

96 97 98 99 100 101 102 103 104 105 106