Page 143 - Proceedings of the 2017 ITU Kaleidoscope

P. 143

Challenges for a data-driven society

rate at which the video need to be encoded at the server. proposedmodel characterizes the state vector asScur=
The client also estimates the percentage of packet loss. {Th} where Th is the estimated throughput. The
Based on these three parameters, the video quality is throughput values are mapped to different discrete
estimated using ITU-T G.1070 parametric model. This is finite state.
used for reward calculation in the proposed algorithm, and b) Action: Qualities are defined on the basis of analyzed
estimated throughput is assigned as the current state, and data segment, andthe quality segments are mapped to
the video parameter is set to guide the current action. With actions.
the help of state, action, and reward, the SBQA algorithm c) Reward Function: This function evaluates the fitness
determines the future action. This action is sent as the of the choice. Thequality measurements through NR
feedback to the server periodically. video metricsprovideas the main input in reward
calculation.
d) Q-Table, Q(S, A): The rows of this matrix represent the
SERVER CLIENT states of the system and each column contains one of
the possible actions (the segment qualities). For a given
pair (s, a), Q(·) indicates the learned benefit that the
system will get in taking action a in states. In order to
formulate the client’s learningand corresponding
Decode and
Media play video actions procedure, the SARSA approach updates Q-
Presentation matrix after each quality decision as follows.
Descriptor
(MPD) , ← , + + ( , )

− ( , ) (1)

SARSA Where scur is the current state, acur is the selected
Based
Quality action, vqis the associated immediate reward, snew is the
Video Adaptation next state after action acur, anew is the action from state
Segment Algorithm snew.The learning rate (α) indicates how much the
acquired information will affect to the old value of Q(·)
in its updating, and the discount factor (γ) weighs the
HTTP Current
state contribution of the immediate and future rewards (0 ≤ γ
≤ 1).
Request MPD
Transcoded e) Exploration Policy: Two exploration policies are taken
Video into consideration here namely, Softmax and ε-Greedy.
Reward

Response MPD estimation Softmax policy chooses action by converting the
action’s expected reward to a probability.The action
ischosen according to the resulting distribution, which
Video is the Boltzmann distributiongiven by
Streaming New state and ( , )

action ( )= (2)

identification | | ( , )
∑

Where r is a positive parameter called temperature, and
Adapt video is number of state of the system. High temperatures
quality
based on Dispatch causeall actions to be nearly equiprobable, whereas low
feedback feedback temperatures cause greedyaction selection.
With ε-greedy, the agent selects at each time step a random
action with a fixed probability, 0 <ε<1, instead of selecting
greedily one ofthe learned optimal actions with respect to
Fig. 1. Architecture of the proposed work the Q-function:
_ _ _ ( ) ,
= (3)
3. ELEMENTS OF PROPOSED WORK ( ) ( , ),
Where 0 < r <1 is a uniform random number drawn at each
The SARSA based algorithm implemented at client forms time step.
the major part of the proposed algorithm. The different The SBQA approach is differentiated as two methods based
elements of SARSA approach used in quality and reward on two exploration policies: SBQA using Softmax Policy
calculation represent the learning and adaptation process. (SBQA-SP), and SBQA using ε-Greedy Policy (SBQA-
No reference (NR) video quality metric is used as a reward GP).
function to guide the corrective actions.

3.1. Elements of SARSA Approach

a) State:It contains pertinent data about the environment
conditions in a given time instance. In particular,the

– 127 –

138 139 140 141 142 143 144 145 146 147 148