Page 201 - Kaleidoscope Academic Conference Proceedings 2020
P. 201
PERT: PAYLOAD ENCODING REPRESENTATION FROM TRANSFORMER FOR
ENCRYPTED TRAFFIC CLASSIFICATION
1
1
Hong Ye He ; Zhi Guo Yang ; Xiang Ning Chen 1
1 Zhongxing Telecommunication Equipment (ZTE) Corporation
ABSTRACT (ML) based method focuses on using manually designed
traffic statistical features to fit a machine learning model for
Traffic identification becomes more important yet more categorization [1]. 4) The deep learning (DL) based method
challenging as related encryption techniques are rapidly is an extension of the ML-based approach where neural
developing nowadays. In difference to recent deep learning networks are applied for automatic traffic feature extraction.
methods that apply image processing to solve such
encrypted traffic problems, in this paper, we propose a Although encrypted traffic packets are hard to identify, an
method named Payload Encoding Representation from encrypted traffic flow (a flow is a consecutive sequence of
Transformer (PERT) to perform automatic traffic feature packets with the same source IP, source port, destination IP,
extraction using a state-of-the-art dynamic word embedding destination port and protocol) is still analyzable because the
technique. Based on this, we further provide a traffic first few packets of a flow may contain visible information
classification framework in which unlabeled traffic is like handshake details [2]. In this way, the ML-based and
utilized to pre-train an encoding network that learns the DL-based methods are considered ideal for encrypted traffic
contextual distribution of traffic payload bytes. Then, the classification since they both extract common features from
downward classification reuses the pre-trained network to the traffic data. In fact, the ML-based and DL-based methods
obtain an enhanced classification result. By implementing share the same concept that traffic flows could be vectorized
experiments on a public encrypted traffic data set and our for supervised training according to their feature extraction
captured Android HTTPS traffic, we prove the proposed strategy.
method can achieve an obvious better effectiveness than
other compared baselines. To the best of our knowledge, this Rather than extracting hand-designed features from the
is the first time the encrypted traffic classification with the traffic as the ML-based method does, the DL-based method
dynamic word embedding alone with its pre-training uses a neural network to perform representation learning (RL)
strategy has been addressed. for the traffic bytes which allow it to avoid complex feature
engineering. It provides an end-to-end solution for encrypted
Keywords – Deep learning, dynamic word embedding, traffic classification where the direct relationship between
encrypted traffic classification, natural language raw traffic data and its categories is learned. The
processing, traffic identification classification effect of a DL-based method is highly related
to its capacity of representation learning.
1. INTRODUCTION
In this paper, we propose a new DL-based solution named
Traffic classification, a task to identify certain categories of Payload Encoding Representations from Transformers
network traffic, is crucial for Internet services providers (ISP) (PERT) in which a dynamic word embedding technique
to track the source of network traffic, and to further ensure called the Bidirectional Encoder Representations from
their quality of service (QoS). Also, traffic classification is Transformers (BERT) [3] is applied during the traffic
widely applied in some specific missions, like malware representation learning phrase. Our work is inspired from the
traffic identification and network attack detection. However, great improvements in the natural language processing (NLP)
this is a challenge since network traffic nowadays is more domain that dynamic word embedding brings. We believe
likely to be hidden with several encryption techniques, that computer communication protocols and natural
making detection hard with a traditional approach. language have some common characteristics. According to
this point, we shall prove that such a strong embedding
Typically, there are the following widely applied traffic technique can also be applied to encode traffic payload bytes
classification methods: 1) The port-based method which and provide substantial enhancement while addressing the
simply identifies traffic data using specific port numbers. It encrypted traffic classification task.
is susceptible to the port number changing and port disguise.
2) Deep packet inspection (DPI), a method which aims to
locate patterns and keywords from traffic packets, is not
suitable for identifying encrypted traffic because it heavily
relies on unencrypted information. 3) The machine learning
978-92-61-31391-3/CFP2068P @ ITU 2020 – 143 – Kaleidoscope