Page 119 - ITU Journal Future and evolving technologies Volume 2 (2021), Issue 4 – AI and machine learning solutions in 5G and future networks
P. 119
ITU Journal on Future and Evolving Technologies, Volume 2 (2021), Issue 4
Table 1 – Four types of data sets for learning and evaluation
Qader et al. compare the performance of faults clas‑
si ication using K‑Means, Fuzzy C Means (FCM), and
Category File Name Data Format
Expectation‑Maximization (EM) [18]. They use data sets
Label Label‑Failure Management json
obtained from a network with heavy and light traf ic sce‑
Log Data Virtual‑Infrastructure json
narios in the routers and servers and build a prototype to Log Data Physical‑Infrastructure json
demonstrate the network traf ic faults classi ication un‑ Log Data Network‑Device json
der given scenarios. The results show that FCM could
based test environment. The data sets consist of labels of
achieve higher accuracy than K‑Means and EM. However,
normal/abnormal traf ic, performance monitoring data
it requires more time to process data. The authors focus
sets such as traf ic volume and CPU/Memory usage ratio,
on the data related to the physical interface only. Thus
and route information such as Border Gateway Protocols
there is insuf icient research on faults classi ication in an
(BGPs) static metrics and BGP route information.
Network Function Virtualization(NFV) environment.
The data collector from KDDI collects and stores data
Recently, KDDI presented an ML comparison framework
sets every minute from the network. Once a failure is
for network analysis [4]. It includes four functional
intentionally caused or recovered, the network indicates
blocks: data set generator, preprocessor, ML‑based fault
a failure or normal status after a transition period, cor‑
classi ier, and evaluator. The data set generator can pe‑
responding to failure data (orange arrows) and recovery
riodically generate failure data, which can be used in
data (blue arrows).
the ML‑based fault classi ication task. They use three
The time interval between a failure and a recovery is 5
algorithms [Multilayer Perceptron (MLP), Random For‑
minutes (Fig. 1). The data sets for training and evaluation
est (RF), Support Vector Machine (SVM)] to train and
provided by KDDI include four types, as in Table 1, which
evaluate. The result shows that RF provides the high‑
are Label‑Failure Management, Virtual‑Infrastructure,
est performance even with a small amount of data, and
Physical‑Infrastructure and Network‑Device.
SVM could improve its performance by increasing train‑
The training data set consists of 8 days of data, totaling ap‑
ing data, feature reduction, or balance adjustment of nor‑
proximately 120G JSON iles. The evaluation data set con‑
mal/abnormal samples. However, the feature extraction
sists of 7 days of data, totaling about 100G of JSON iles.
method and training ef iciency are not mentioned in their
study. Training ef iciency is an important metric for the
evaluation of training models. Feature extraction is an es‑ 3.1.1 Data collection and merging method
sential step in achieving the excellent performance of an The JSON ile’s content is enormous, and most of the infor‑
ML method. Especially for a large amount of network log mation is useless string description information. So we it‑
data, ef iciently extracting useful information from raw erate through each object, looking for objects of numeric
data can allow our model to perform better in a much type. We extract these objects as features from log iles
shorter training time. (in JSON format) and merge them with labels into a CSV
ile based on time (Fig. 2).
3. METHODOLOGY We utilize paths like ”key1/key1‑1/key1‑1‑1...” as keys to
extract features from physical, virtual, and network JSON
This section introduces the data sets and shows how log iles for all log iles. For BGP‑related entries, we use
we extract features. Then we introduce several machine the number of next‑hops in each array and their pre ixes
learning models used in this research. as features.
3.1 Data preprocessing
Fig. 2 – Data mergence principles
3.1.2 Data differential method
This subsection explains how our comparison framework
preprocesses Performance Management (PM) data to
Fig. 1 – Data collection principles [4]. put into Machine Learning (ML) models for training.
As shown in Fig. 3, each failure generation cycle is 5min.
As shown in Fig. 1, The data sets used for this study In the failure generation cycle, the last‑minute data in the
are created in the NFV‑based test environment simulated previous cycle is considered as regular data, and the last‑
for a commercial IP core network. In this sense, syn‑ minute data in the current cycle is considered as failure
thetic data is similar to real data, resulting from the NFV‑ data. To highlight the differences between normal and
© International Telecommunication Union, 2021 103