Page 132 - ITU Journal, Future and evolving technologies - Volume 1 (2020), Issue 1, Inaugural issue

P. 132

ITU Journal on Future and Evolving Technologies, Volume 1 (2020), Issue 1

continuously interacting with a training where φ1, φ2 and φ3 are the weights of each
environment and updating the NN parameters as a component.
result of these interactions. The training The first and second components in (1) correspond,
environment considered here is a network respectively, to the SLA satisfaction ratio γSLA(k) of
simulator that mimics the behavior of the real the slice k and the aggregate for the rest of slices
network when varying the offered load of the k’≠k. Specifically, γSLA(k) is the ratio between the
different slices in the different cells and when aggregate throughput obtained by the slice across
modifying the resource usage quota allocated to all cells T(k) and the minimum between the
each slice as a result of the actions made by the DQN aggregate offered load A(k) and the
agents. In this respect, the simulator is fed by dlThptPerSlice(k) term of the SLA and is computed
training data consisting of multiple time patterns of as:
the required capacity (i.e. offered load) of the slices
in the different cells. This data can be either built ( ) = ( ( ) , 1) (2)
synthetically or extracted from real network (dlThptPerSlice( ), ( ))
measurements. The training is assumed to be where A(k) is the aggregate across all the cells of the
executed in a training host, located at the SMO, with per-cell offered load O(k,n) of slice k bounded by the
the necessary libraries, supporting tools and limit established by the TermDensity(k) and
computational capabilities for training the DQN dlThptPerUe(k) parameters of the SLA in the service
models and running the simulator. area S(n) of each cell n, that is:
For carrying out the training process, each DQN ( ) = ∑ ( ( , ), dlThptPerUe( ) ·
=1
agent is composed by three different elements: (i) TermDensity( ) · ( )) (3)
The evaluation NN, which corresponds to the
Qk(s(k),a(k),k) being learnt that will eventually The third component of the reward is the capacity
determine the policy to be applied at the ML utilization factor, γu(k), which aims at minimizing
inference host. (ii) The target NN, which is another the over-provisioning of capacity and is defined as
NN with the same structure as the evaluation NN the ratio between the aggregate throughput T(k)
but with weights k . It is used for obtaining the so- obtained by the slice and the total capacity allocated
-
called Time Difference (TD) target required for to the slice across all cells, that is:
updating the evaluation NN. (iii) The experience ( ) = ( ) (4)
data set (ED), which stores the experiences of the ∑ ( )· ( , )
=1
agent resulting from the interactions with the where C(n) is the capacity of cell n.
training environment as explained in the following.
The reward r(k) is provided by the training
The interactions between the DQN agent and the environment to the DQN agent at the end of each
training environment occur in time steps of time step and, correspondingly, the T(k) and A(k)
(simulated time) duration t. In each time step the values correspond to average values during the
DQN agent of the k-th slice observes the state s(k) in time step.
the training environment and selects an action a(k).
Action selection is based on an ε-Greedy policy that, As a result of the interactions between the training
with probability 1-ε, chooses the action that environment and the DQN agent, each experience of
maximizes the output of the evaluation NN, and, the ED is represented by a tuple that includes the
with probability ε, chooses a random action. As a state observed at the beginning of a given time step,
result of applying the selected action, the training the selected action, the obtained reward as a result
environment generates a reward value r(k) that of this action and the new state observed at the end
assesses how good the action was from the of the time step duration.
perspective of the desired behavior. In particular, in The experiences stored in the ED are used by the
the considered approach the reward captures both DQN agent to progressively update the values of the
the SLA satisfaction and the capacity utilization. In weights k and  k in the evaluation and target NNs,
-
this way, the reward for slice k is defined as the respectively. For each time step, the update of the
weighted product of three terms given by: weights k of the evaluation NN is performed by
2

( ) = ( ) 1 · ( 1 ∑ ′ ( ′)) · ( ) 3 (1) randomly selecting a mini batch of experiences of

−1 =1 the ED and updating the weights of the evaluation
′≠
NN k according to the mini-batch gradient descent

127 128 129 130 131 132 133 134 135 136 137