Page 38 - First special issue on The impact of Artificial Intelligence on communication networks and services

P. 38

,78 -2851$/ ,&7 'LVFRYHULHV 9RO 0DUFK

Figure 11. Evaluation results of YOLO-tiny on mobile
GPUs and different FPGA platforms.

Table 4. Evaluation results of SSD on CPU, GPU and FPGA
platforms.
Intel Xeon NVIDIA GTX Xilinx ZU9
Platform
E5-2640 v4 1080TI GPU FPGA
SSD (YOLO)
Task SSD (YOLO)
Figure 10. Our CPU+FPGA system architecture. Pruned
Operations
16.6 7.4
(GOPs)
fps 4.88 183.48 9.09 20.00
4.3. Hardware architecture design Power (W) 90 250 14
Efﬁciency
0.054 0.734 0.649 1.429
(fps/W)
Our Aristotle hardware architecture design [74] is given
in Fig. 10. A CPU+FPGA accelerator design is adopted,
which consists of two parts: the processing system (PS) and 4.4. Performance evaluation
the programmable logic (PL). PS contains the low-power
CPU processors and the external memory, which offers We use the YOLO algorithm to evaluate our Aristotle sys-
programmability and data capacity. Instructions will be tem, which is the most popular real-time detection algorithm
transferred into PL and decoded to implement the control of in the RTAV area. Fig. 11 shows the comparison of per-
PL. PL is the on-chip design where the majority of the CNN formance on different platforms. We can see that compared
accelerator logic is located, and can be scalable due to the with the same level mobile GPU platforms our system can
chosen FPGA platform. PEs are placed inside PL for parallel reach a similar performance. However, the power consump-
MAC operations, which can complete the convolving pro- tion of our Zynq-7020 and ZU2 based systems are around 3
cess through multiple iterations. Some functions that cannot W, while the power of GPU is 15 W. Moreover, the peak per-
be efﬁciently accelerated with PE, such as several kinds of formance of TK1 is 326 GOPS and that of TX1 is 1 TOPS,
pooling and an element-wise dot product, will be contained while the peak performance of our FPGA platforms is only
inside a MISC calculation pool for optional use. On-chip around 100 GOPS. These can prove a much better efﬁciency
buffers will be provided to offer data reuse opportunities of our system design.
controlled by a scheduler, and communicate with external
We also use the YOLO version SSD [75] algorithm to com-
memories using a data mover such as a direct memory ac-
pare our larger FPGA systems with CPUs and GPUs. SSD
cess controller (DMAC). Such hardware architecture design
is an optimized algorithm based on YOLO with multi-scale
can be easily shared between layers which are friendly to feature extractions which can improve the ability to capture
instruction generation and high-level programming.
small objects. Table. 4 lists the results on different platforms.
Instead of combining every multiplication of one ﬁlter win- We can see that both GPU and FPGA solutions can reach a
dow together, we split the computing kernel into smaller faster performance than the Intel Xeon CPU. The power con-
granularity, which can avoid the waste of arithmetic resource sumption of the NVIDIA GTX 1080TI GPU can get up to
while dealing with a large ﬁlter size or window stride, and 250 W, while the value of FPGA is only 14 W. From the per-
can ensure a regular data access pattern for easier control. spective of efﬁciency, with the pruning method implemented,
Furthermore, a smaller granularity of PE can increase the our design can get an efﬁciency almost twice that of 1080TI
GPU.
chance of skipping for sparsity, which can save the overall
execution time of the system. Furthermore, we have tested a Densebox [76] model on our

,QWHUQDWLRQDO 7HOHFRPPXQLFDWLRQ 8QLRQ

33 34 35 36 37 38 39 40 41 42 43