Page 207 - Kaleidoscope Academic Conference Proceedings 2021

P. 207

Connecting physical and virtual worlds

of the scheduled user are transmitted, while the buﬀers of the 4.3 Experiment description
other two users were empty.
We developed an experiment using CAVIAR for the problem
of scheduling and beam selection. Given that a complete
4.2 Possible inputs to RL agents episode ﬁle contains information about all moving objects in
a scene (all pedestrians, cars, etc.), we simpliﬁed the data
The inputs (also known as states or observations) can generated by the simulation assuming that the beam selection
be selected both from information provided in CSV ﬁles RL agent, named B-RL, only uses data from the three served
(position (x,y,z), velocities, etc.) or obtained from the users (uav1, simulation_car1 and simulation_pedestrian1).
environment, such as the buﬀer state and channel information
for speciﬁc beam index previously chosen.

More speciﬁcally, the RL agent can use: the UEs geographic
position in X, Y, Z, with the origin of the coordinate system
being on the BS RL agent. Also, the UEs orientation in the
three rotation coordinates: the front and side roll angles, as
well as its rotation over its own axis. Besides that, there are
also dropped, transmitted and buﬀered packets. Finally, the
last two other available input features for the agent are the bit
rate and the channel magnitude at each step of the simulation.

Note that we assume the BS (more speciﬁcally, the RL agent)
ˆ
does not know the best index i. In practice, this would require
a full beam sweep, which is assumed to be unfeasible in our
Figure 7 – Channel maximum throughput when using always
model due to stringent time requirements. Similarly, given ˆ
the best beam index i and a simple scheduling strategy that
that the RL agent chose user u and beam index j at time t,
chooses users sequentially (1-2-3-1-2-3...), in a round-robin
it learns only the magnitude |y j | and the spectral eﬃciency
fashion.
S u,t, j for this speciﬁc user and beam index at time t.
The following results are extracted from an Advantage Actor
The channel throughput T u,t,j = S u,t,j BW is obtained by Critic (A2C) agent from the Stable-Baselines [18], trained
multiplying the spectral eﬃciency by the bandwidth BW and with default parameters. The states of the agent are deﬁned by
indicates the maximum number of bits that can be transmitted. seven features: X, Y, Z, packets dropped, packets transmitted,
An empirical factor is used to adjust T u,t,j in order to deﬁne buﬀered packets and bit rate. The action space is composed
the network load, such that, for the given input traﬃc, some by a vector with two integers: a numeric identity of the
packets have to be dropped. user being allocated at the speciﬁc timestamp, that can range
between [0, 2]; and the codebook index of the beam to be
Algorithm 1 summarizes the steps for executing an already used to serve it, which is an integer from the range [0, 63].
trained RL agent. Finally, the reward used Eq. (5).

Algorithm 1: High-level algorithm of the RL-based
Because the RL agent was designed to play the role of a simple
scheduling and beam selection problem.
example and not optimize performance, two other agents were
Initialization for a given episode e; developed: B-Dummy and B-BeamOracle. The B-Dummy
e
while t ≤ N do agent assumes random action choices for both the scheduled
s
1) Based on the number of bits in the buﬀers of user and which beam index to use. The B-BeamOracle agent
the users and other input information, RL agent follows a sequential user scheduling pattern (1-2-3-1-2-3,...)
schedules user u and selects beam index i; in a round-robin fashion, and always uses the optimum beam
2) Environment calculates combined channel index i for the selected user. In Figure 7 we characterize the
ˆ
magnitude |y i | and corresponding throughput T i ; channel maximum throughput of this experiment when using
3) The number of transmitted bits is B-BeamOracle.
R i = min(T u,t, j ; b u ) ;
4) Update buﬀers;
5) Receive new packets; 5. EXPERIMENT RESULTS
6) Eventually drops packets;
7) Environment calculates rewards and updates its The CAVIAR environment was used to generate 70 episodes,
state; from which 50 were used for training the RL agent, and 20 for
8) Update buﬀers again; testing. We present results for the three agents: B-Dummy,
end B-BeamOracle and the RL-based A2C agent.

In Figure 8, it is possible to verify the switching at every 1000

– 145 –

202 203 204 205 206 207 208 209 210 211 212