Page 37 - First special issue on The impact of Artificial Intelligence on communication networks and services
P. 37

,78 -2851$/  ,&7 'LVFRYHULHV  9RO        0DUFK

















         Figure 8. The software-hardware co-design workflow of our
         system.


         software. With this motivation, we will introduce our system
         design in the following section.

                                                             Figure 9. Quantization results for different CNN models.
         4. SOFTWARE-HARDWARE CO-DESIGN FOR A RE-
         CONFIGURABLE AUTONOMOUS VISION SYSTEM

         4.1. The overall system workflow                    Table 3. Comparison of compression ratio between quanti-
                                                            zation, pruning and matrix transformation methods at differ-
         What we have already achieved is an FPGA-based system  ent accuracy loss levels (baseline 32-bit floating-point).
         called Aristotle to target CNN acceleration, which can deal  Accuracy  Quantization  Pruning  Quantization
                                                                       SVD
         with various CNN-based applications and can be conve-  Loss             Only       Only   and Pruning
         niently mapped onto different FPGA platforms. For a better  0%  -       5.8x      10.3x      27.0x
         processing performance, we should reduce the software  1%     5.4x      14.1x     15.6x      35.7x
         workload and improve the hardware utilization rate. Accord-  2%  6.5x   14.9x     19.1x      37.0x
         ingly, we design the software hardware co-design workflow  4%  6.9x      15.4x     22.9x      37.7x
         of our Aristotle system depicted in Fig. 8. To reduce the
         workload, we compress the models using software methods
         like quantization, pruning and matrix transformation. To
                                                            loss and the loss can be further healed by retraining. Table
         improve the utilization rate, the compiler will take the com-
                                                            3 has shown that if we combine pruning and quantization to-
         pressed model and hardware parameters of different FPGA
                                                            gether, the compressed model size would be the smallest with
         platforms as inputs, and execute a task tiling with dataflow
                                                            negligible accuracy loss. Together with Huffman coding, the
         optimizations to generate instructions for the hardware. The  model size of AlexNet can be reduced by 35 times, and that
         hardware architecture will exploit the parallelism on-chip  of VGG-16 can be reduced by 49 times. We should notice
         for higher throughput with proper granularity choice and
                                                            the randomness of sparsity from pruning, which is tough to
         datapath reuse. The details will be introduced as follows.
                                                            be efficiently used for hardware execution. To deal with this
                                                            case, we add some constraints to limit the pruned connec-
         4.2. Compression methods                           tions in regular patterns, and this can increase the number
                                                            of all zero channels for more skips during the acceleration
         Usually, an algorithm model is trained in floating-point form,
                                                            process.
         but there exists redundancy. Previous work has shown that it
         is not necessary to represent every datum with 32-bit, and  Moreover, we can see that inside the basic MAC operations
         an appropriate data quantization would not hurt the overall  of CNN, multiplication is always the most resource consum-
         accuracy of the model. In Fig. 9 we have made an experiment  ing operation, so reducing the number of multiplications can
         of quantization on state-of-the-art CNN models, and as we  also enhance the hardware performance. Matrix transforma-
         can see from an 8-bit quantization brings little loss to the  tion like Winograd [72] and FFT [73] can achieve this goal
         accuracy. A lower bit-width can directly compress the size  by targeting different sizes of filters. Take Winograd trans-
         of memory footprint, and can bring chance to share datapath  formation as example, if we tile the input feature maps into
         consists of integrated DSP blocks. We can implement two  6 × 6 blocks and convolve it with 3 × 3 filters, through trans-
         multipliers for 8-bit inputs with one 25 × 18 DSP block on  formation we can reduce the number of multiplications by
         Xilinx platform.                                   2.25 times and replace them with cheap add and shifting op-
                                                            erations.
         Another method is to implement a pruning process to the pre-
         trained model, in order to decrease the number of connec-  With all these compression methods above, we can reduce
         tions inside a model [71]. It has been proved that some of the  the workload of the original model, which will benefit the on-
         connections that have weights close to zero will make a small  chip memory and arithmetic resources and system through-
         impact on the output pixel, and can be pruned without much  put speed.




                                             ‹ ,QWHUQDWLRQDO 7HOHFRPPXQLFDWLRQ 8QLRQ
   32   33   34   35   36   37   38   39   40   41   42