Page 105 - Proceedings of the 2018 ITU Kaleidoscope

P. 105

Machine learning for a 5G future

percentage of received data, for all the applications obtaining 1 No policy
the ﬁnal performance index value D. In order to observe Deep RL
the index after the action execution, we wait for ten seconds. 0.8
Depending on the diﬀerence between the indexes evaluated
before and after the execution on an action in the environment,
we were able to deﬁne the reward as a number r which 0.6
can assume the following values: [-1,-2/3,-1/3,0,1/3,2/3,1] Percentage of received data - D
where a value nearby one means that the action performed 0.4
resulted in an improvement of system performance while a
value nearby minus one means that the action performed
resulted in a decrease of the system performance. For the 0.2
sake of simplicity, we considered that UEs can only run one
application per time. Our goal is to produce an optimal
policy, which is able to address the problem related to the 0 100 200 300 400
user’s mobility inside the network in order to improve user Simulation time (sec)
QoS. Figure 5 – Comparison between the performance obtained by
With respect to the DNN we designed, here we sum up the the Deep RL policy and a scenario where the data migration
main parameters in the following table: is not enabled.
DNN parameters maintain in both simulation the same user mobility pattern.

Number o f hidden layers 3 Plots show that the Deep RL algorithm is able to improve
the overall system performance, in particular except for a
Number o f neurons 15
little period between 100 and 200 seconds where the Deep
Input dimension 21 RL algorithm encounters a little decrease (mainly due to the
Output dimension 9 stochasticity of the environment), the results are in general
Learning rate 0.001 good reaching an average of 0.60 which is better if compared
with the no policy average equal to 0.54. As we are writing,
Activation f unction ReLU
we are trying to extend the training time with the aim to
Update step 50 further improve the obtained results.
Batch size 32
Experience replay dimension 2000 6. CONCLUSIONS
In this paper, we presented a deep reinforcement learning
Table 2 – Deep Neural Network parameters.
approach to address the problem related to the network
With reference to Table 2, by doing multiple tests, we were environment dynamics. We designed a Deep RL algorithm
able to establish that 3 hidden layers create a good topology and tested it in a real scenario demonstrating the feasibility of
which is able to properly ﬁt the desired output. Moreover, we the technique. Future works will be devoted to implement
ﬁxed 15 neurons for each layer which is a number that stays in a better integration with the OMNeT++ environment by
between the input layer dimension and the output one. With using the Tensorﬂow C++ frontend, to compare with other
respect to the activation function, we used the Rectiﬁed Linear solutions,to use more realistic traﬃc and mobility models,
Unit (ReLU) which resulted in a faster learning if compared and to the investigation of new indexes with the aim to further
with other functions like the sigmoid. Since our DNN has to improve the system performance.
predict the state Q-values which are values deﬁned in the set
of R, the problem we tried to solve is a regression. For this REFERENCES
reason, the cost function we used is the Mean Squared Error [1] D. Bruneo, S. Distefano, F. Longo, G. Merlino, and
(MSE) which is typical for this kind of problems and deﬁned
as: A. Puliaﬁto, “I/Ocloud: Adding an IoT Dimension to
Cloud Infrastructures,” Computer, vol. 51, no. 1, pp.
1 n Õ 2
MSE = (y i − by i ) (12) 57–65, January 2018.
n
i [2] R. Dautov, S. Distefano, D. Bruneo, F. Longo,
where y is the real output and by i is the output predicted G. Merlino, and A. Puliaﬁto, “Data processing
by the DNN. Regarding the update step, the batch size, and in cyber-physical-social systems through edge
the experience replay dimension, the values we set has been computing,” IEEE Access, vol. 6, pp. 29822–29835,
obtained empirically by trying diﬀerent values. 2018.
In Fig.5 we show a comparison between the policy learned
after training for 25000 seconds of simulation our Deep [3] P. Bellavista, A. Zanni, and M. Solimando, “A
RL algorithm and a scenario without any policy where we migration-enhanced edge computing support for mobile
simply distributed one application for each MEC server. For devices in hostile environments,” in 2017 13th
a fair comparison, we used the same random seed in order to International Wireless Communications and Mobile

– 89 –

100 101 102 103 104 105 106 107 108 109 110