Page 201 - Kaleidoscope Academic Conference Proceedings 2020
P. 201

PERT: PAYLOAD ENCODING REPRESENTATION FROM TRANSFORMER FOR
                                     ENCRYPTED TRAFFIC CLASSIFICATION



                                                   1
                                                                1
                                         Hong Ye He ; Zhi Guo Yang ; Xiang Ning Chen 1
                                   1 Zhongxing Telecommunication Equipment (ZTE) Corporation


                              ABSTRACT                        (ML)  based  method  focuses  on  using  manually  designed
                                                              traffic statistical features to fit a machine learning model for
           Traffic  identification becomes more important yet more   categorization [1]. 4) The deep learning (DL) based method
           challenging as  related encryption techniques  are  rapidly   is  an  extension  of  the  ML-based  approach  where  neural
           developing nowadays. In difference to recent deep learning   networks are applied for automatic traffic feature extraction.
           methods  that  apply  image processing  to  solve such
           encrypted traffic  problems, in  this paper, we propose a   Although encrypted traffic packets are hard to identify, an
           method  named Payload Encoding Representation  from   encrypted traffic flow (a flow is a consecutive sequence of
           Transformer (PERT)  to perform automatic traffic  feature   packets with the same source IP, source port, destination IP,
           extraction using a state-of-the-art dynamic word embedding   destination port and protocol) is still analyzable because the
           technique. Based on this,  we  further provide  a  traffic   first few packets of a flow may contain visible information
           classification  framework in which unlabeled traffic  is   like handshake details [2]. In this way, the ML-based and
           utilized  to pre-train an encoding  network that learns  the   DL-based methods are considered ideal for encrypted traffic
           contextual  distribution  of  traffic payload  bytes.  Then, the   classification since they both extract common features from
           downward classification reuses the pre-trained network to   the traffic data. In fact, the ML-based and DL-based methods
           obtain  an enhanced classification  result. By  implementing   share the same concept that traffic flows could be vectorized
           experiments on a public encrypted traffic data set and our   for supervised training according to their feature extraction
           captured  Android HTTPS traffic, we prove the proposed   strategy.
           method  can achieve an obvious better effectiveness  than
           other compared baselines. To the best of our knowledge, this   Rather  than  extracting  hand-designed  features  from  the
           is the first time the encrypted traffic classification with the   traffic as the ML-based method does, the DL-based method
           dynamic word embedding alone with  its pre-training   uses a neural network to perform representation learning (RL)
           strategy has been addressed.                       for the traffic bytes which allow it to avoid complex feature
                                                              engineering. It provides an end-to-end solution for encrypted
             Keywords – Deep learning, dynamic word embedding,   traffic  classification  where  the  direct  relationship  between
                encrypted traffic classification, natural language    raw  traffic  data  and  its  categories  is  learned.  The
                      processing, traffic identification      classification effect of a DL-based method is highly related
                                                              to its capacity of representation learning.
                         1.  INTRODUCTION
                                                              In this paper, we propose a new DL-based solution named
           Traffic classification, a task to identify certain categories of   Payload  Encoding  Representations  from  Transformers
           network traffic, is crucial for Internet services providers (ISP)  (PERT)  in  which  a  dynamic  word  embedding  technique
           to track the source of network traffic, and to further ensure   called  the  Bidirectional  Encoder  Representations  from
           their quality of service (QoS). Also, traffic classification is   Transformers  (BERT)  [3]  is  applied  during  the  traffic
           widely  applied  in  some  specific  missions,  like  malware   representation learning phrase. Our work is inspired from the
           traffic identification and network attack detection. However,   great improvements in the natural language processing (NLP)
           this is a challenge since network traffic nowadays is more   domain that dynamic word embedding brings. We believe
           likely  to  be  hidden  with  several  encryption  techniques,   that  computer  communication  protocols  and  natural
           making detection hard with a traditional approach.     language have some common characteristics. According to
                                                              this  point,  we  shall  prove  that  such  a  strong  embedding
           Typically,  there  are  the  following  widely  applied  traffic   technique can also be applied to encode traffic payload bytes
           classification  methods:  1)  The  port-based  method  which   and  provide  substantial  enhancement  while addressing the
           simply identifies traffic data using specific port numbers. It   encrypted traffic classification task.
           is susceptible to the port number changing and port disguise.
           2) Deep packet inspection (DPI), a method which aims to
           locate  patterns  and  keywords  from  traffic  packets,  is  not
           suitable for identifying encrypted traffic because it heavily
           relies on unencrypted information. 3) The machine learning





           978-92-61-31391-3/CFP2068P @ ITU 2020             – 143 –                                Kaleidoscope
   196   197   198   199   200   201   202   203   204   205   206