Page 102 - Proceedings of the 2018 ITU Kaleidoscope
P. 102

‎ 2018 ITU Kaleidoscope Academic Conference‎




           Q values in the update rule which can be expressed as follows:

                       y j,a = r j + γ max(Q(s j+1 , θ))  (4)
                                            b
                                       b
           where y j,a is the Q-value associated to the state j when
           the agent performs the action a,r j is the reward associated
           to the state s j , and Q(s j+1 , θ) is the output layer of target
                                  b
                            b
           DNN network which contains the Q-values estimation for the
           state s j+1 . By using two networks instead of just one, the
           obtained result is a more stable training which is not affected
           by learning loops due to the self-update network targets that
           sometimes can lead to oscillations or policy divergence.

           3.2  Problem Formulation
                                                               Algorithm 1: Deep RL
           In order to properly design a deep RL algorithm, it is first of  1 initialize experience replay memory E to {}
           all necessary to define the environment in which the RL agent  2 random initialize main DNN network weights θ
           will operate. In this subsection, we will formalize the MEC
                                                                                          b
                                                              3 set target DNN network weights θ equal to θ
           scenario we are interested in.
                                                              4 set discount factor γ
           We start defining as N the number of eNBs present in the  5 set batch size
           MEC-enabled LTE network. For the sake of simplicity and  6 set update step U
           without loss of generality, let us assume that to each eNB is  7 set waiting time t
           attached one and only one MEC server. Then, let us define  8 set exploration rate
           the set, with cardinality N ∈ N, of the MEC servers attached  9 set decay rate d
           to the eNBs as:                                    10 for episode = 1 to end:
                                                              11  observe current state s j
                 MEC = [MEC 1 , MEC 2 , MEC 3 , ..., MEC N ]  (5)  12  p = random([0,1])
                                                              13  if   > p:
           Users are free to move around the MEC environment  14   action = random([1,Z])
           and attach to different eNBs through seamless handover  15  else:
           procedures. Moreover, they run several applications whose  16  action = argmax(Q(s j ,θ))
           data is contained in the Cloud or inside one of the MEC  17  end if
           servers. Regarding the applications that users can run and the  18  execute the action
           number of devices attached to the eNBs, we can define the  19  wait(x seconds)
           following sets:                                    20  observe the new state s j+1
                                                              21  observe the reward r
                    Apps = [app 1 , app2, app 3 , ..., app M ]  (6)  22  store the t-uple (s j ,action,s j+1 ,r) in E
                                                              23  sample a batch from E
                    Devices = [UE 1 , UE 2 , UE 3 , ..., UE K ]  (7)  24  y = Q(s j ,θ)
                                                                  y target = Q(s j+1 ,θ)
                                                              25
           with M, K ∈ N. With respect to the actions that the agent can  y action = r + γ · max(y target )
                                                                          b
                                                                               b
           perform in the environment, let us define the set of actions  26  execute one training step on main DNN network
           with cardinality Z ∈ N as:                         27
                                                              28  every U steps set θ = θ
                                                                                b
                        Actions = [a 1 , a 2 , a 3 , ..., a Z ]  (8)  29 end for
           where each element represents an app migration from the
           Cloud towards one of the servers defined in the MEC set,
           from one MEC server to the Cloud, or among MEC servers.
           Another important element that we have to introduce is the
           concept of state which is made of a t-uple containing all
           the information related to the users position and the app
           distribution over the network which is defined as follows:


                                                         (9)
                         s =< (s 1 , s 2 , s 3 , ..., s T ) >
           Finally, with respect to the reward, we define it as a number
           r ∈ R that is computed as a combination of several network
           performance indexes.




                                                           – 86 –
   97   98   99   100   101   102   103   104   105   106   107