Page 38 - First special issue on The impact of Artificial Intelligence on communication networks and services
P. 38

,78 -2851$/  ,&7 'LVFRYHULHV  9RO        0DUFK























                                                            Figure 11.  Evaluation results of YOLO-tiny on mobile
                                                            GPUs and different FPGA platforms.


                                                            Table 4. Evaluation results of SSD on CPU, GPU and FPGA
                                                            platforms.
                                                                         Intel Xeon  NVIDIA GTX        Xilinx ZU9
                                                              Platform
                                                                        E5-2640 v4   1080TI GPU          FPGA
                                                                                                         SSD (YOLO)
                                                                Task              SSD (YOLO)
             Figure 10. Our CPU+FPGA system architecture.                                                   Pruned
                                                             Operations
                                                                                      16.6                   7.4
                                                              (GOPs)
                                                                fps        4.88        183.48      9.09     20.00
         4.3. Hardware architecture design                   Power (W)      90          250               14
                                                             Efficiency
                                                                           0.054        0.734     0.649     1.429
                                                              (fps/W)
         Our Aristotle hardware architecture design [74] is given
         in Fig. 10. A CPU+FPGA accelerator design is adopted,
         which consists of two parts: the processing system (PS) and  4.4. Performance evaluation
         the programmable logic (PL). PS contains the low-power
         CPU processors and the external memory, which offers  We use the YOLO algorithm to evaluate our Aristotle sys-
         programmability and data capacity.  Instructions will be  tem, which is the most popular real-time detection algorithm
         transferred into PL and decoded to implement the control of  in the RTAV area. Fig. 11 shows the comparison of per-
         PL. PL is the on-chip design where the majority of the CNN  formance on different platforms. We can see that compared
         accelerator logic is located, and can be scalable due to the  with the same level mobile GPU platforms our system can
         chosen FPGA platform. PEs are placed inside PL for parallel  reach a similar performance. However, the power consump-
         MAC operations, which can complete the convolving pro-  tion of our Zynq-7020 and ZU2 based systems are around 3
         cess through multiple iterations. Some functions that cannot  W, while the power of GPU is 15 W. Moreover, the peak per-
         be efficiently accelerated with PE, such as several kinds of  formance of TK1 is 326 GOPS and that of TX1 is 1 TOPS,
         pooling and an element-wise dot product, will be contained  while the peak performance of our FPGA platforms is only
         inside a MISC calculation pool for optional use. On-chip  around 100 GOPS. These can prove a much better efficiency
         buffers will be provided to offer data reuse opportunities  of our system design.
         controlled by a scheduler, and communicate with external
                                                            We also use the YOLO version SSD [75] algorithm to com-
         memories using a data mover such as a direct memory ac-
                                                            pare our larger FPGA systems with CPUs and GPUs. SSD
         cess controller (DMAC). Such hardware architecture design
                                                            is an optimized algorithm based on YOLO with multi-scale
         can be easily shared between layers which are friendly to  feature extractions which can improve the ability to capture
         instruction generation and high-level programming.
                                                            small objects. Table. 4 lists the results on different platforms.
         Instead of combining every multiplication of one filter win-  We can see that both GPU and FPGA solutions can reach a
         dow together, we split the computing kernel into smaller  faster performance than the Intel Xeon CPU. The power con-
         granularity, which can avoid the waste of arithmetic resource  sumption of the NVIDIA GTX 1080TI GPU can get up to
         while dealing with a large filter size or window stride, and  250 W, while the value of FPGA is only 14 W. From the per-
         can ensure a regular data access pattern for easier control.  spective of efficiency, with the pruning method implemented,
         Furthermore, a smaller granularity of PE can increase the  our design can get an efficiency almost twice that of 1080TI
                                                            GPU.
         chance of skipping for sparsity, which can save the overall
         execution time of the system.                      Furthermore, we have tested a Densebox [76] model on our




                                           ‹ ,QWHUQDWLRQDO 7HOHFRPPXQLFDWLRQ 8QLRQ
   33   34   35   36   37   38   39   40   41   42   43