Page 120 - Proceedings of the 2018 ITU Kaleidoscope
P. 120

2018 ITU Kaleidoscope Academic Conference




           mapping between state-action pairs and real numbers, that is,   {   ,    , … ,    }, where    is the number of allowed RATs to
                                                                0
                                                                          
                                                                   1
           ℛ:    ×    → ℝ.                                    use, and     models the action of not reporting but dropping
                                                                      0
                                                              the event. Each of these generated events can have a different
           After receiving some reward, that can be positive or negative,  priority  (    ).  For  instance,  if  a  malfunction  in  a  life-
           IoT nodes shift from one state to another, again, depending   supporting device is detected, a top-priority event must be
           on the previous state-action  pair. These transitions can be   sent. Contrary, if a regular event is detected (such as mild
           stochastic to allow RL entities to probabilistically transition   vibrations  in  an  engine),  a  low-priority  event  could  be
           from one state to another. Formally, ℘ models the transition   generated. Priorities are let to vary in a range from 0 to 1 to
           from one state to another by mapping a tuple (  ,   ,   ) to a   illustrate  different  event  priorities  –where  1  means  top
           real  number  –representing  the  probability  of  transitioning   priority–.
           from  state      to  state    ′  after  taken  action     –.  That  is,
           ℘:    ×    ×    → ℝ.                               To model the nature of wireless communications, each RAT
                                                              may  have  a  limit  on  its  usage.  This  can  be  due  to  two
           The goal of RL algorithms is to find an optimal action policy   different reasons: (i) a limit in the total expenditure allowed,
             ∗
           (   ) that maximizes the expected total reward obtained over   e.g.  per  day,  derived  from  using  such  technology  –for
           some finite or infinite horizon (such accumulated reward is   example, IoT nodes may not be allowed to spend more than
           denoted as   ). In the former case, the reward is aggregated   1$ a day when using 5G transmissions–. Or (ii) a limit on the
           for    units of time -e.g.    seconds- whereas in the latter, it is   traffic  generated  by  any  given  technology,  this  can  be
           the average reward per unit of time what it is maximized.   expressed in bytes (e.g. per day) or in packets –for example,
           Therefore, the objective of     is to, being the IoT node in a   Sigfox nodes cannot generate more than 140 packets a day
                                  ∗
           certain state   , propose an action    to take such as the total   [23], or nodes making use of cellular technologies may not
           expected  attained  reward      is  maximized.  Following  the   generate more than 1Mb of traffic a day–. Therefore, when
           above formulation, an action policy    can be mathematically   action     (with    ≠ 0) is taken, the state    of the IoT mote
                                                                       
           represented as a mapping between states and actions, that is,   changes since the usage of the technology   , denoted as    ,
                                                                                                               
             :    →   . This optimal policy can be implemented either as   is also updated. When the usage of technology    reaches its
           a  tabular  solution  (i.e.  for  each  state,  a  table  stores  the   limit,           , such a technology is no longer available that day.
                                                                      
           optimal action to take) or approximated by a function (i.e.   Without any loss of generality, periods of 24 hours (1 day)
           there  is  a  function  that  takes  states  as  inputs  and  returns   are considered in limiting the usage of RAT.
           actions as outputs). When the process under optimization is
           relatively complex, the number of potential different states   Furthermore,  each  action/RAT  entails  a  different  energy
           (the space of   ) is too large to be tabulated. Being function   consumption  (denoted  as      for  action     ).  Since  a  single
                                                                                       
           approximators the only feasible alternative, and due to the   battery per node is assumed, if battery level (denoted as   )
           recent  successes  of  Artificial  Neural  Networks  (ANN)  in   drops to zero, no further events can be reported. To complete
           approximating  functions,  a  plethora  of  ANN-based   the  definition  of  the  node  state,  the  length  (    )  of  the
           algorithms have recently emerged in the RL field. The basic   generated packet (created as a response to an arising event)
           idea is to have an ANN that, when fed with the current state   must  be  considered.  It  should  be  noted  that  the  event-
           of the RL entity (the IoT node in our case), it returns the most   generation process is modeled as a Poisson distribution with
           promising action to follow.                        an average rate of    events per second.

           Among  all  ANN-based  alternatives,  Evolution  Strategies   As  commented  in  the  Introduction,  some  LPWAN
           (ES) [21] has recently demonstrated to be one of the best-  technologies,  depending  on  the  country,  must  undergo  an
           suited  alternatives  to  derive  optimal  policies;  especially   enforced  off-period  (            )  after  every  transmission.  To
           when the effects of the actions are long-lasting (that is, taking   model  this  and,  at  the  same  time,  packet  buffering
           the  action      at  a  given  instant  t  has  a  measurable  non-  capabilities, individual infinite queues are assumed to exist
           negligible  effect  at  time   ′,  with   ′ ≫   ).  ES  is  a  type  of   for each RAT (that is, there exist    different queues in each
           Genetic  Algorithm  [22],  a  black-box  optimization  meta-  node). Therefore, the transmission time of a packet does not
           heuristic loosely inspired in natural selection. By iteratively   only depend on the length of such a packet (  ) but also on
           tweaking the parameters of the ANN via natural selection,   the occupation of the queues (denoted as     for the   -th RAT).
                                                                                                 
           the modeled policy    tends to improve in proposing actions   If an off-period of            seconds is enforced in an IoT node as
           to take.                                           a result of a packet transmission, the LPWAN queue of that
                                                              node is not only filled with such a packet, but also artificially
                 5.  APPLICATION TO THE PROBLEM               extended with another fictitious packet that would take          
                                                              seconds to be transmitted. Note that this artificially generated
           Let an IoT network monitor a set of critical assets with some   packet has no impact on the obtained reward. Using this trick,
           parameters of interest. Such a network is, in turn, composed   we force nodes not to use the LPWAN RAT for, at least,   
           of IoT nodes provided with different RATs that can be used   seconds -and thus, to comply with regional regulations-.        
           to report certain detected events. Thus, having detected an
           event, an IoT node must decide whether to report it or not. If   Finally, from the mathematical point of view, the state    of a
           it chooses to report it, it also has to determine which RAT to   node   is   the   vector   conformed   by
           use. Thus, the set of all allowed actions    is composed of





                                                          – 104 –
   115   116   117   118   119   120   121   122   123   124   125