Page 143 - Proceedings of the 2017 ITU Kaleidoscope
P. 143

Challenges for a data-driven society




           rate at  which the  video  need to be encoded at the server.   proposedmodel characterizes the state vector asScur=
           The client also estimates the percentage of packet loss.   {Th} where  Th is the estimated throughput. The
           Based on these three parameters, the video quality is   throughput values are  mapped to  different discrete
           estimated using ITU-T G.1070 parametric  model. This is   finite state.
           used for reward calculation in the proposed algorithm, and   b) Action: Qualities are defined on the basis of analyzed
           estimated throughput is assigned as the current  state, and   data segment, andthe quality segments are mapped to
           the video parameter is set to guide the current action. With   actions.
           the help of state, action, and reward, the SBQA algorithm   c)  Reward Function:  This function evaluates the  fitness
           determines the  future action. This action is  sent as the   of the choice. Thequality  measurements through NR
           feedback to the server periodically.                   video  metricsprovideas the  main input in reward
                                                                  calculation.
                                                              d) Q-Table, Q(S, A): The rows of this matrix represent the
               SERVER                         CLIENT              states of the system and each column contains one of
                                                                  the possible actions (the segment qualities). For a given
                                                                  pair (s, a), Q(·) indicates the learned benefit  that the
                                                                  system will get in taking action a in states. In order to
                                                                  formulate the client’s learningand corresponding
                                              Decode and
                  Media                       play video          actions procedure, the SARSA approach updates Q-
                Presentation                                      matrix after each quality decision as follows.
                Descriptor
                 (MPD)                                                 ,        ←        ,        +      +    (      ,      )

                                                                             −  (    ,   )                                    (1)


                                               SARSA              Where  scur is the current  state,  acur is the  selected
                                               Based
                                               Quality            action, vqis the associated immediate reward, snew is the
                 Video                       Adaptation           next state after action acur, anew is the action from state
                 Segment                      Algorithm           snew.The learning rate (α) indicates  how  much the
                                                                  acquired information will affect to the old value of Q(·)
                                                                  in its updating, and the discount factor (γ) weighs the
                                 HTTP         Current
                                                state             contribution of the immediate and future rewards (0 ≤ γ
                                                                  ≤ 1).
                              Request MPD
               Transcoded                                     e)  Exploration Policy: Two exploration policies are taken
                 Video                                            into consideration here namely, Softmax and ε-Greedy.
                                               Reward

                              Response MPD    estimation          Softmax policy chooses action by converting the
                                                                  action’s expected reward to a probability.The action
                                                                  ischosen according to the resulting distribution, which
                 Video                                            is the Boltzmann distributiongiven by
                Streaming                   New state and                     ( ,  )

                                               action              (  )=                                 (2)

                                             identification               |    |     ( ,  )
                                                                         ∑

                                                                  Where r is a positive parameter called temperature, and
               Adapt video                                          is number of state of the system. High temperatures
                 quality
                based on                      Dispatch            causeall actions to be nearly equiprobable, whereas low
                feedback                      feedback            temperatures cause greedyaction selection.
                                                              With ε-greedy, the agent selects at each time step a random
                                                              action with a fixed probability, 0 <ε<1, instead of selecting
                                                              greedily one ofthe learned optimal actions with respect to
                 Fig. 1. Architecture of the proposed work    the Q-function:
                                                                              _      _    _ ( ) ,
                                                                      =                                    (3)
                  3. ELEMENTS OF PROPOSED WORK                                        ( )  ( ,  ),
                                                              Where 0 < r <1 is a uniform random number drawn at each
           The SARSA based algorithm implemented at client forms   time step.
           the  major part of the proposed algorithm. The different   The SBQA approach is differentiated as two methods based
           elements of SARSA approach used in quality and reward   on two exploration policies:  SBQA using  Softmax Policy
           calculation represent the learning and adaptation process.   (SBQA-SP), and SBQA using  ε-Greedy Policy (SBQA-
           No reference (NR) video quality metric is used as a reward   GP).
           function to guide the corrective actions.

           3.1. Elements of SARSA Approach

           a)  State:It contains pertinent data about the environment
               conditions in a given time instance. In particular,the




                                                          – 127 –
   138   139   140   141   142   143   144   145   146   147   148