Page 101 - Proceedings of the 2018 ITU Kaleidoscope
P. 101

Machine learning for a 5G future


































                                           Figure 1 – The MEC-enabled LTE scenario.


           its ability to learn through a trial and error process, which is  utility through the concept of Q-value Q(s, a) which returns
           very similar to the human being learning one, makes it the  the value of doing action a when the agent is in the state s:
           best choice to solve decision making problems. RL adopts, as
           a basic model, the MDP formalism which is a framework for  Q(s, a) = Q(s, a) + α(R(s + γmax a (Q(s , a ) − Q(s, a)) (2)
                                                                                               0
                                                                                                 0
                                                                                           0
           modeling the decision making in stochastic environments.
           From a mathematical point of view, an MDP is defined as
           follows:                                                           U(s) = max a (Q(s, a))        (3)
                                                              where α is the learning rate and γ is the discount factor.
             • set of states S that the environment can assume;
                                                              When the environment has a large number of states, it is
             • set of actions A that the agent can perform;   impossible to use traditional Q-learning. In fact, the time to
                                                              converge increases as the state space becomes larger, so if we
             • probability transition matrix P;               consider a scenario with a huge number of states (e.g. 10 20
                                                              or even more), it is impossible to think that the agent should
             • reward function R(s) S : → R                   visit all of them multiple times to learn a good policy [9].
                                                              One way to address this kind of problem is the quantization
             • discount factor γ ∈ [0,1] which defines the importance  which consists in the process of grouping a set of states into a
               we are giving to future rewards                single one thus reducing the number of states which describe

           In such a context, we define the figure of the RL agent whose  an environment. However, such a technique is not helpful
           objective is to find an optimal policy in order to maximize the  especially in those cases where the state space reduction
           reward function in each state of the environment. Using  would result in a too large quantization error that could lead
           the MDP definition, the agent is able to run across the  the agent to learn a wrong policy.
           environment states several times and change the system  A very interesting alternative is provided by the use of a Deep
           policy to improve the reward accordingly. Bellman equation  Neural Network (DNN) to create a function approximator
           expresses the relationship between the utility of a state and  capable to predict the Q-values for a given state without
           its neighbors [9]:                                 explicitly using eqs.  ((2))-((3)) and the corresponding
                                                              MDP. Deep Reinforcement Learning is a technique which
                                      Õ     0       0         has been pioneered by DeepMind [10]. The idea comes
                                          P(s |s, a)U(s )  (1)
                 U(s) = R(s) + γ · max a∈A                    from the necessity to find a new way to represent complex
                                       s  0
                                                              environments, where the dimension of the state space and
           where U(s) is the immediate reward obtained in the state s  action space is very large making impossible to solve them
           assuming that the agent will choose the optimal action.  by using traditional approaches. The key idea at the base of
           When the probability transition matrix is not known, a typical  the deep reinforcement learning consists in the use of two
           RL approach is the Q-learning [9], a model free technique  separated DNNs parameterized with θ and θ respectively:
                                                                                                  b
           which tries to learn the relationship between the execution  the Main DNN used to predict the Q values associated to a
           of an action on a given state and the associated reward or  generic state and the Target DNN used to generate the target




                                                           – 85 –
   96   97   98   99   100   101   102   103   104   105   106