Page 181 - Kaleidoscope Academic Conference Proceedings 2021

P. 181

Connecting physical and virtual worlds

Window size: The length of a single window utilized during • FFT length: The Fast Fourier Transform size
training is measured in milliseconds. When the model is
• Low frequency: Lowest band edge of Mel–scale
trained, this much data is required to classify. It determines
ﬁlterbanks
the amount of the data that will be processed per classiﬁcation,
in milliseconds. For instance, in this scenario, the wing beat • High frequency: Highest band edge of Mel–scale
of a mosquito is heard, most likely for one second, then it will ﬁlterbanks
be processed for 180 milliseconds.
• Window size: The size of sliding window for local
Window increase: Multiple windows of the sample are cepstral mean normalization, which corresponds to a
produced if the training data is larger than the window number of samples.
size. (For example, the mosquito wing beat is heard for
5 seconds, while the window length is one second.) The
To extract frequency bands, triangle ﬁlters are used on
step is determined by the ‘window increase’. For a 5000ms
a Mel–scale once the spectrogram is computed. Filter
long sample, the windows will be: 0–180ms, 1180–1360ms,
number, Low frequency, and High frequency parameters
2360–2540ms, 3540–3720ms, 4720–4900ms = 5 windows
are used to determine the frequency band and number of
with a window size of 180ms and a window increase of
frequency characteristics to be retrieved. Listeners consider
1000ms.
the Mel–scale to be a perceptual scale of pitches that are
equal in distance from one another. The aim is to extract
The Audio MFE block, which extracts time and frequency
more characteristics (ﬁlter banks) in lower frequencies and
characteristics from a signal, is the second block. In the
fewer in higher frequencies, thus it works best on sounds that
frequency domain, however, it employs a non–linear scale
can be diﬀerentiated by the human ear[11]. The following
known as the Mel–scale. It works eﬀectively with audio
ﬁgure shows the Mel–ﬁlterbank energy features set for this
data, primarily for non–voice recognition applications where
model training (Figure 5).
the sounds to be categorized are easily distinguishable by
the human ear[11]. The raw audio data of a mosquito’s
one–second wing beat is shown in the following images,
together with the window size (Figure 4a) and spectrogram
of that window (Figure 4b).

(a) Raw audio sample

(b) Spectrogram

Figure 4 – Raw sample and spectrogram

Mel–ﬁlterbank energy features and their meaning:[11]
Figure 5 – Mel–ﬁlterbank energy features
• Frame length: A time frequency matrix is produced
while making a spectrogram. The matrix’s time columns
are all frame length long (in seconds, so 20ms with The plot of all the data in the data set is shown below
default conﬁgurations). (Figure 6). The MFE block’s spectrograms will be sent into a
neural network architecture that is very excellent at learning
• Frame stride: The step between successive frame in
to detect patterns in data. The feature explorer displays a
seconds. This is the same as ‘window increase’ above,
3D representation of the whole data set, with each data item
but for the spectrogram’s time columns.
color–coded according to its label. All of the characteristics
• Filter number: The number of triangular ﬁlters applied from audio ﬁles are reduced down to only three, and then
to the spectrogram grouped based on similarity; this is a fantastic approach to

– 119 –

176 177 178 179 180 181 182 183 184 185 186