Page 43 - Proceedings of the 2018 ITU Kaleidoscope
P. 43

Machine learning for a 5G future




           analysis it adapts the video quality parameters (resolution,   The learning rate  (α) shows how much  the  acquired
           frame rate).                                       knowledge will influence the old value while updating. It is
                                                              set between 0 and 1, setting it to 0 means that the Q-value
           2.3  Client Side Functions                         will never be updated while setting a high value means the
                                                              learning can occur quickly. The discount factor (γ) which is
           The client initialises the media player component with the   also set between 0 and 1,  models the  fact that  future
           media URL  with the appropriate HTTP port and the   rewards are worth less than immediate rewards.
           decoding parameters needed to receive the streaming data.
           It also establishes a TCP/IP connection with the server. The   The Double Sarsa works similar to double Q-learning, but
           video track information is examined to obtain the   since it an on policy, the exploration policies like Softmax,
           parameters of the video that is being streamed,  which in   ε – greedy etc. need to be modified.
           turn is used to identify the current action of the system. The
           client captures the packets and estimates the throughput.   3.3  Exploration policy
           The reward is calculated as described in Mode 2 of the Pv
           module in ITU-T P.1203.1 video quality estimation model.   Two exploration policies are taken into consideration here
           The estimated throughput is used  to identify the current   namely, Softmax and ε – greedy. Softmax utilizes action-
           state. With  the obtained state, action and the reward, the   selection probabilities which are determined by ranking the
           proposed algorithm decides the future action to be taken by   value-function estimates using a Boltzmann distribution
           the server and then sends  this action as  feedback to the   given by
           server. The feedback is given at a regular interval.           ( , )   ( , )



                                                                 ⁄
                                                              π(  ) =          ( , )     ( , )                 (3)
             3.  COMPONENTS OF THE PROPOSED WORK                      ∑

                                                              where  τ is a positive parameter called temperature, and  b
           3.1  State-Action-Reward-State-Action (Sarsa)      represents all possible actions. High temperatures cause all
                                                              actions to be  nearly equiprobable, whereas  low
           State is the current situation  returned by the environment   temperatures cause greedy action selections.
           and it contains data regarding the environment at a  given
           time instance. Here the state is characterised as scur = {th1,   In Epsilon  greedy policy a random actions  with  uniform
                                th
           th2,  …..thn} where  thn  is  n  estimated throughput, thus the   distribution in given a set of actions is chosen.
           throughput values are  mapped to the different discrete   This policy allows to select random action  with  ε
           states. An action represents all the moves the agent can take   probability (0 <ε< 1) or an action with 1-ε probability while
           and hence  here the actions represent the different quality   maximizing reward in a defined state. The ε-greedy policy
           adaptation process, thus the various possible quality   that uses the average of the two tables to determine the
           objectives are  mapped to the actions.  A reward is   greedy action is as follows
           immediate return  sent back from the environment to   π(  ) =
                                                                 ⁄
           evaluate the last action. The input for the reward calculation
           is the output from the NR video quality assessment metrics.   1  − ε,    a =           ℰ  ( )     ,  ′   +     ,  ′


                                                                               ε                              (4)
           The Sarsa is basically an on-policy learning method, where          ,   ,     ℎ

           the system interacts  with the environment in order to
                                                                       ⁄
           updates its policy based on actions taken. In the Q-table, the   where   π(  ) is the probability of taking action  a from
           rows and columns of the matrix are the states of the system   states, and  Na  is the number of actions that can be taken
           and the possible actions respectively. For any given pair   from state  s.  Double Sarsa based adaptation algorithm is
           (s,a), the Q-value represents the learned value that the   thus used with two exploration policies.
           system  will acquire by taking the action  a in  state  s
           formulated as                                      3.4  Video Quality Estimation using No-Reference
                                                                  Metrics
           Q(s,a) ← Q(s,a) + α[r+ γQ(s’,a’) – Q(s,a)]         (1)
                                                              ITU-T  P.1203.1 defines a  set of objective parametric
           where α represents learning rate and γ the discount factor,   quality assessment  modules [16]. Although P.1203.1
                ’
              ’
                                                          ’
                                                  ’
           Q(s , a ) the Q-value resulting from new action a in state s .   recommendation describes four different quality modes (0,
                                                              1, 2 and 3), mode 2 is used here as it deals  with  no
           3.2  Double Sarsa                                  encryption  with medium complexity.  The  following
                                                              parameters are used in the description of the model:
           In Double Sarsa  method [15], two action-value estimates   Quant (quant∈ [0,  1]): It is a parameter to represent the
           Q A   (s,a)  and  Q B   (s,a) are defined in improving the   degradation due to quantization.
           performance of Sarsa in stochastic  scenarios. Thus the   Scale Factor (scaleFactor∈ [0, 1]): This parameter is used
           update rule for Double Sarsa is given as follows.   to capture the upscaling degradation.
                                                              FrameRate (framerate): It is the video rate in frames per
                                   B
                                            A
                      A
            A
           Q  (s,a) ← Q (s,a) + α[r+ γQ (s’,a’) – Q (s,a)]            (2)   second.

                                                          – 27 –
   38   39   40   41   42   43   44   45   46   47   48