Page 120 - Proceedings of the 2018 ITU Kaleidoscope

P. 120

2018 ITU Kaleidoscope Academic Conference

mapping between state-action pairs and real numbers, that is, { , , … , }, where is the number of allowed RATs to
0

1
ℛ: × → ℝ. use, and models the action of not reporting but dropping
0
the event. Each of these generated events can have a different
After receiving some reward, that can be positive or negative, priority ( ). For instance, if a malfunction in a life-
IoT nodes shift from one state to another, again, depending supporting device is detected, a top-priority event must be
on the previous state-action pair. These transitions can be sent. Contrary, if a regular event is detected (such as mild
stochastic to allow RL entities to probabilistically transition vibrations in an engine), a low-priority event could be
from one state to another. Formally, ℘ models the transition generated. Priorities are let to vary in a range from 0 to 1 to
from one state to another by mapping a tuple ( , , ) to a illustrate different event priorities –where 1 means top
real number –representing the probability of transitioning priority–.
from state to state ′ after taken action –. That is,
℘: × × → ℝ. To model the nature of wireless communications, each RAT
may have a limit on its usage. This can be due to two
The goal of RL algorithms is to find an optimal action policy different reasons: (i) a limit in the total expenditure allowed,
∗
( ) that maximizes the expected total reward obtained over e.g. per day, derived from using such technology –for
some finite or infinite horizon (such accumulated reward is example, IoT nodes may not be allowed to spend more than
denoted as ). In the former case, the reward is aggregated 1$ a day when using 5G transmissions–. Or (ii) a limit on the
for units of time -e.g. seconds- whereas in the latter, it is traffic generated by any given technology, this can be
the average reward per unit of time what it is maximized. expressed in bytes (e.g. per day) or in packets –for example,
Therefore, the objective of is to, being the IoT node in a Sigfox nodes cannot generate more than 140 packets a day
∗
certain state , propose an action to take such as the total [23], or nodes making use of cellular technologies may not
expected attained reward is maximized. Following the generate more than 1Mb of traffic a day–. Therefore, when
above formulation, an action policy can be mathematically action (with ≠ 0) is taken, the state of the IoT mote

represented as a mapping between states and actions, that is, changes since the usage of the technology , denoted as ,

: → . This optimal policy can be implemented either as is also updated. When the usage of technology reaches its
a tabular solution (i.e. for each state, a table stores the limit, , such a technology is no longer available that day.

optimal action to take) or approximated by a function (i.e. Without any loss of generality, periods of 24 hours (1 day)
there is a function that takes states as inputs and returns are considered in limiting the usage of RAT.
actions as outputs). When the process under optimization is
relatively complex, the number of potential different states Furthermore, each action/RAT entails a different energy
(the space of ) is too large to be tabulated. Being function consumption (denoted as for action ). Since a single

approximators the only feasible alternative, and due to the battery per node is assumed, if battery level (denoted as )
recent successes of Artificial Neural Networks (ANN) in drops to zero, no further events can be reported. To complete
approximating functions, a plethora of ANN-based the definition of the node state, the length ( ) of the
algorithms have recently emerged in the RL field. The basic generated packet (created as a response to an arising event)
idea is to have an ANN that, when fed with the current state must be considered. It should be noted that the event-
of the RL entity (the IoT node in our case), it returns the most generation process is modeled as a Poisson distribution with
promising action to follow. an average rate of events per second.

Among all ANN-based alternatives, Evolution Strategies As commented in the Introduction, some LPWAN
(ES) [21] has recently demonstrated to be one of the best- technologies, depending on the country, must undergo an
suited alternatives to derive optimal policies; especially enforced off-period ( ) after every transmission. To
when the effects of the actions are long-lasting (that is, taking model this and, at the same time, packet buffering
the action at a given instant t has a measurable non- capabilities, individual infinite queues are assumed to exist
negligible effect at time ′, with ′ ≫ ). ES is a type of for each RAT (that is, there exist different queues in each
Genetic Algorithm [22], a black-box optimization meta- node). Therefore, the transmission time of a packet does not
heuristic loosely inspired in natural selection. By iteratively only depend on the length of such a packet ( ) but also on
tweaking the parameters of the ANN via natural selection, the occupation of the queues (denoted as for the -th RAT).

the modeled policy tends to improve in proposing actions If an off-period of seconds is enforced in an IoT node as
to take. a result of a packet transmission, the LPWAN queue of that
node is not only filled with such a packet, but also artificially
5. APPLICATION TO THE PROBLEM extended with another fictitious packet that would take
seconds to be transmitted. Note that this artificially generated
Let an IoT network monitor a set of critical assets with some packet has no impact on the obtained reward. Using this trick,
parameters of interest. Such a network is, in turn, composed we force nodes not to use the LPWAN RAT for, at least,
of IoT nodes provided with different RATs that can be used seconds -and thus, to comply with regional regulations-.
to report certain detected events. Thus, having detected an
event, an IoT node must decide whether to report it or not. If Finally, from the mathematical point of view, the state of a
it chooses to report it, it also has to determine which RAT to node is the vector conformed by
use. Thus, the set of all allowed actions is composed of

– 104 –

115 116 117 118 119 120 121 122 123 124 125