Page 37 - First special issue on The impact of Artificial Intelligence on communication networks and services
P. 37
,78 -2851$/ ,&7 'LVFRYHULHV 9RO 0DUFK
Figure 8. The software-hardware co-design workflow of our
system.
software. With this motivation, we will introduce our system
design in the following section.
Figure 9. Quantization results for different CNN models.
4. SOFTWARE-HARDWARE CO-DESIGN FOR A RE-
CONFIGURABLE AUTONOMOUS VISION SYSTEM
4.1. The overall system workflow Table 3. Comparison of compression ratio between quanti-
zation, pruning and matrix transformation methods at differ-
What we have already achieved is an FPGA-based system ent accuracy loss levels (baseline 32-bit floating-point).
called Aristotle to target CNN acceleration, which can deal Accuracy Quantization Pruning Quantization
SVD
with various CNN-based applications and can be conve- Loss Only Only and Pruning
niently mapped onto different FPGA platforms. For a better 0% - 5.8x 10.3x 27.0x
processing performance, we should reduce the software 1% 5.4x 14.1x 15.6x 35.7x
workload and improve the hardware utilization rate. Accord- 2% 6.5x 14.9x 19.1x 37.0x
ingly, we design the software hardware co-design workflow 4% 6.9x 15.4x 22.9x 37.7x
of our Aristotle system depicted in Fig. 8. To reduce the
workload, we compress the models using software methods
like quantization, pruning and matrix transformation. To
loss and the loss can be further healed by retraining. Table
improve the utilization rate, the compiler will take the com-
3 has shown that if we combine pruning and quantization to-
pressed model and hardware parameters of different FPGA
gether, the compressed model size would be the smallest with
platforms as inputs, and execute a task tiling with dataflow
negligible accuracy loss. Together with Huffman coding, the
optimizations to generate instructions for the hardware. The model size of AlexNet can be reduced by 35 times, and that
hardware architecture will exploit the parallelism on-chip of VGG-16 can be reduced by 49 times. We should notice
for higher throughput with proper granularity choice and
the randomness of sparsity from pruning, which is tough to
datapath reuse. The details will be introduced as follows.
be efficiently used for hardware execution. To deal with this
case, we add some constraints to limit the pruned connec-
4.2. Compression methods tions in regular patterns, and this can increase the number
of all zero channels for more skips during the acceleration
Usually, an algorithm model is trained in floating-point form,
process.
but there exists redundancy. Previous work has shown that it
is not necessary to represent every datum with 32-bit, and Moreover, we can see that inside the basic MAC operations
an appropriate data quantization would not hurt the overall of CNN, multiplication is always the most resource consum-
accuracy of the model. In Fig. 9 we have made an experiment ing operation, so reducing the number of multiplications can
of quantization on state-of-the-art CNN models, and as we also enhance the hardware performance. Matrix transforma-
can see from an 8-bit quantization brings little loss to the tion like Winograd [72] and FFT [73] can achieve this goal
accuracy. A lower bit-width can directly compress the size by targeting different sizes of filters. Take Winograd trans-
of memory footprint, and can bring chance to share datapath formation as example, if we tile the input feature maps into
consists of integrated DSP blocks. We can implement two 6 × 6 blocks and convolve it with 3 × 3 filters, through trans-
multipliers for 8-bit inputs with one 25 × 18 DSP block on formation we can reduce the number of multiplications by
Xilinx platform. 2.25 times and replace them with cheap add and shifting op-
erations.
Another method is to implement a pruning process to the pre-
trained model, in order to decrease the number of connec- With all these compression methods above, we can reduce
tions inside a model [71]. It has been proved that some of the the workload of the original model, which will benefit the on-
connections that have weights close to zero will make a small chip memory and arithmetic resources and system through-
impact on the output pixel, and can be pruned without much put speed.
,QWHUQDWLRQDO 7HOHFRPPXQLFDWLRQ 8QLRQ