Page 43 - Proceedings of the 2018 ITU Kaleidoscope

P. 43

Machine learning for a 5G future

analysis it adapts the video quality parameters (resolution, The learning rate (α) shows how much the acquired
frame rate). knowledge will influence the old value while updating. It is
set between 0 and 1, setting it to 0 means that the Q-value
2.3 Client Side Functions will never be updated while setting a high value means the
learning can occur quickly. The discount factor (γ) which is
The client initialises the media player component with the also set between 0 and 1, models the fact that future
media URL with the appropriate HTTP port and the rewards are worth less than immediate rewards.
decoding parameters needed to receive the streaming data.
It also establishes a TCP/IP connection with the server. The The Double Sarsa works similar to double Q-learning, but
video track information is examined to obtain the since it an on policy, the exploration policies like Softmax,
parameters of the video that is being streamed, which in ε – greedy etc. need to be modified.
turn is used to identify the current action of the system. The
client captures the packets and estimates the throughput. 3.3 Exploration policy
The reward is calculated as described in Mode 2 of the Pv
module in ITU-T P.1203.1 video quality estimation model. Two exploration policies are taken into consideration here
The estimated throughput is used to identify the current namely, Softmax and ε – greedy. Softmax utilizes action-
state. With the obtained state, action and the reward, the selection probabilities which are determined by ranking the
proposed algorithm decides the future action to be taken by value-function estimates using a Boltzmann distribution
the server and then sends this action as feedback to the given by
server. The feedback is given at a regular interval. ( , ) ( , )

⁄
π( ) = ( , ) ( , ) (3)
3. COMPONENTS OF THE PROPOSED WORK ∑

where τ is a positive parameter called temperature, and b
3.1 State-Action-Reward-State-Action (Sarsa) represents all possible actions. High temperatures cause all
actions to be nearly equiprobable, whereas low
State is the current situation returned by the environment temperatures cause greedy action selections.
and it contains data regarding the environment at a given
time instance. Here the state is characterised as scur = {th1, In Epsilon greedy policy a random actions with uniform
th
th2, …..thn} where thn is n estimated throughput, thus the distribution in given a set of actions is chosen.
throughput values are mapped to the different discrete This policy allows to select random action with ε
states. An action represents all the moves the agent can take probability (0 <ε< 1) or an action with 1-ε probability while
and hence here the actions represent the different quality maximizing reward in a defined state. The ε-greedy policy
adaptation process, thus the various possible quality that uses the average of the two tables to determine the
objectives are mapped to the actions. A reward is greedy action is as follows
immediate return sent back from the environment to π( ) =
⁄
evaluate the last action. The input for the reward calculation
is the output from the NR video quality assessment metrics. 1 − ε, a = ℰ ( ) , ′ + , ′

ε (4)
The Sarsa is basically an on-policy learning method, where , , ℎ

the system interacts with the environment in order to
⁄
updates its policy based on actions taken. In the Q-table, the where π( ) is the probability of taking action a from
rows and columns of the matrix are the states of the system states, and Na is the number of actions that can be taken
and the possible actions respectively. For any given pair from state s. Double Sarsa based adaptation algorithm is
(s,a), the Q-value represents the learned value that the thus used with two exploration policies.
system will acquire by taking the action a in state s
formulated as 3.4 Video Quality Estimation using No-Reference
Metrics
Q(s,a) ← Q(s,a) + α[r+ γQ(s’,a’) – Q(s,a)] (1)
ITU-T P.1203.1 defines a set of objective parametric
where α represents learning rate and γ the discount factor, quality assessment modules [16]. Although P.1203.1
’
’
’
’
Q(s , a ) the Q-value resulting from new action a in state s . recommendation describes four different quality modes (0,
1, 2 and 3), mode 2 is used here as it deals with no
3.2 Double Sarsa encryption with medium complexity. The following
parameters are used in the description of the model:
In Double Sarsa method [15], two action-value estimates Quant (quant∈ [0, 1]): It is a parameter to represent the
Q A (s,a) and Q B (s,a) are defined in improving the degradation due to quantization.
performance of Sarsa in stochastic scenarios. Thus the Scale Factor (scaleFactor∈ [0, 1]): This parameter is used
update rule for Double Sarsa is given as follows. to capture the upscaling degradation.
FrameRate (framerate): It is the video rate in frames per
B
A
A
A
Q (s,a) ← Q (s,a) + α[r+ γQ (s’,a’) – Q (s,a)] (2) second.

– 27 –

38 39 40 41 42 43 44 45 46 47 48