Page 203 - Kaleidoscope Academic Conference Proceedings 2020
P. 203

Industry-driven digital transformation




                                                              Linear  projections:  Each  embedding  vector  xi  will  be
                                                              projected to three vectors using linear transformations:


                                                                                            
                                                                                     =                
                                                                                      
                                                                                     =                    (1)
                                                                                            
                                                                                               
                                                                                            
                                                                                      =                
                                                                                      
                                                                          Q
                                                              Where  W ,  W   and  W   are  the  three  groups  of  linear
                                                                      K
                                                                                  V
                                                              parameters.
                                                              Self-attention and optional masking: The purpose of linear
                                                              projections  is  to  generate  the  inputs  for  the  self-attention
                                                              mechanism.  Generally  speaking,  self-attention  is  to  figure
                                                              the  compatibility  between  each  input  xi  and  all  the  other
                                                              inputs x1~xk via a similarity function, and further to calculate
                                                              a weighted sum for xi which implies its overall contextual
                                                              information.  In  detail,  our  self-attention  is  calculated  as
                                                              follows:

                                                                                                            
                                                                                               
             Figure 2 – Representation learning network for payload                      =   ∑          =1  �           ×         (2)
                                                                                   
                                                                                                    
              data using the dynamic word embedding architecture                             
                                                              The similarity between xi and xj is figured by a scaled dot-
           The illustration of our representation learning is shown in   product operator, where d k is the dimension of Kj, and Z is
           Figure 2.                                          the normalization factor. It should be noticed that not every
                                                              input  vector  is  needed  for  self-attention  calculation.  An
                                                              optional masking strategy that randomly ignores a few inputs
                                                              while  generating attention  vectors  is allowed  to avoid the
                                                              over-fitting.

                                                              Multi-head attention: In order to grant encoders the ability
                                                              of  reaching  more  contextual  information,  the  transformer
                                                              encoding  applies  a  multi-head  attention  mechanism.
                                                              Specifically, linear projections will be done for M times for
                                                              each xi to generate multiple attention vectors [atti,1, atti,2, ...,
                                                              atti,M].  Afterward,  a  concatenation  operator  is  utilized  to
                                                              obtain the final attention vector:

                                                                                  =               ⊕               ⊕ … ⊕                  (3)
                                                                                   ,          ,          ,    
                                                              Feed-forward network (FFN): A full-connected network to
                                                              provide the output of current encoder. For xi, it is as follows:

                                                                            =             (0,                   +      ) +         (4)
                                                                                                        
                                                                                         
                                                                                              
                                                                            
                                                                                                  

                                                              W1,  b1,  W2  and  b2  are  the  full-connection  parameters  and
                   Figure 3 – Detail of the encoding layer    max(0, x) is a ReLU activation.
           Earlier  dynamic  word  embedding  called  the  Embeddings   Finally, we get the dynamic embedding hi which is encoded
           from  Language  Models  (Elmo)  [12]  uses  the  bidirectional   from xi. It can be further encoded by the next encoding layer
           LSTM as its encoder unit, which is not suitable for large-  or be directly used in downward tasks. Similar to the naming
           scale training since the LSTM has a bad support for parallel   of  BERT,  we  name our encoding  network as  the  Payload
           calculations. To solve this problem, [3] replaced the LSTM   Encoding  Representation  from  Transformer  (PERT)
           with  a  self-attention  encoder  that  is  firstly  applied  in  the   considering the application of a transformer encoder.
           transformer model [13], and named their embedding model
           as  BERT.  This  is  what  we  also  use  for  encoding  the   3.3   Packet-level Pre-training
           encrypted  payload as  shown in  Figure 3. Taking  our  first
           embedding  vectors  [x1,  x2,  ...,  xk]  as  examples,  there  are   A  key  factor  that  makes  BERT  and  its  extensive  models
           several steps of the transformer encoding as follows:   continuously achieve state-of-the-art results among a wide




                                                          – 145 –
   198   199   200   201   202   203   204   205   206   207   208