Page 35 - First special issue on The impact of Artificial Intelligence on communication networks and services
P. 35

,78 -2851$/  ,&7 'LVFRYHULHV  9RO        0DUFK

























         Figure 7. Hardware designs of CNN accelerators on different platforms and development route for RTAV accelerator in
         ADAS.
                         (Source by:https://nicsefc.ee.tsinghua.edu.cn/projects/neural-network-accelerator/)

         [51] shows the comparison between neural network acceler-  less than FPGAs, and the dynamic power consumption ratio
         ators, as depicted in Fig. 7. We can see from the image that  is approximately 14 for FPGA to ASIC, while the average
         GPUs are among the top tier of computing speeds, but the  chip area of ASIC is also 18 times smaller than FPGA. This
         power consumption is also very high. The freshly released  means we can realize a much better performance with ASIC
         NVIDIA Tesla V100 can get an astounding computing speed  within a given hardware area. ASIC designs have the rela-
         of 120 TOPS [52], with a power consumption of 300 W. This  tively better energy efficiency, mostly between 100 GOPS/W
         can be useful in datacenter scenarios for cases like model  to 10 TOPS/W. They have shown excellent performance in
         training where power is not the main concern. There are  low-power area, and as we can see from Fig. 7 some repre-
         also some GPUs designed for low-power embedded environ-  sentative designs such as DianNao [55], Eyeriss [56] and En-
         ments, like NVIDIA Jetson TX1 mobile GPU, which brings  vision [57] are showing a performance of around 100 GOPS
         a 300 GOPS speed on VGG-16 and a peak performance of  with only milli-watt level power consumption. The efficiency
         1 TOPS with only a 10 W cost [53]. The large general pur-  can even reach 10 TOPS/W at extreme low voltage status.
         pose stream processors on chip might bring a considerable  To the other side, those ASICs with larger chip sizes are
         parallelism, but the efficiency remains a question. With the  capable of offering more abundant PEs and memory band-
         technology of 28nm, the NVIDIA TITAN-X and TX1 GPU  width, which can lead to a faster throughput speed, such as
         can only get an efficiency of 20-100 GOPS/W.        Googles TPU [58] which can get a peak performance of 86
                                                            TOPS. From the business aspect, a large quantity production
         To improve the efficiency, we need to customize the inside
                                                            of ASIC could also reduce the overall cost. However, note
         logic of processing elements (PEs) to enhance processing
                                                            that the deep-learning algorithms for RTAV have a quite short
         parallelism and optimize memory access patterns. FPGA
                                                            evolving cycle, usually within six to nine months. Moreover,
         could be a suitable initial selection, since it can provide
                                                            the benchmarks for RTAV are also far from perfect and new
         a large amount of computing and memory resources and
                                                            tasks appear nearly every year. While ASICs time to market
         enough reconfigurability with programmable interconnec-
                                                            is no less than one year, there is a potential risk of incom-
         tion to map common algorithms on. In Fig. 7 we can see
                                                            patibility between hardware processors and fresh algorithms
         that there have been many FPGA designs. The top designs,
                                                            and application scenes. Solution providers need to make a
         including our Aristotle system on the Xilinx ZU9 FPGA plat-
                                                            risk-return analysis.
         form, can get a throughput speed at around 2 TOPS, which
         is quite close to the same technology generation NVIDIA  Recently, some breakthroughs have taken place in the area
         TITAN-X GPUs, but of almost 10 times better efficiency.  of near-memory and in-memory computing. The 3-D mem-
         This proves the capability of FPGA of being a strong com-
                                                            ory can offer an order of magnitude higher bandwidth and
         petitor.
                                                            several times power consumption than 2-D memory, such
         As we can see, most CNN layers consist of MAC operations  as Hyper Memory Cube (HMC) proposed by Micron [59],
         and have similar computing patterns which could be possibly  which uses through silicon vias (TSV) to stack the dynamic
         generalized and parameterized. Therefore, with mature hard-  random-access memory (DRAM) on top of the logic cir-
         ware architecture and processing flow, it is feasible to harden  cuit. Through this method, the memory bandwidth can be
         the original FPGA accelerator design into an ASIC chip with  increased by an order of magnitude from 2-D memory, and
         a programmable interface for reconfigurability, which can  the power consumption can be five times less. There have
         further improve performance. Kuon et al. [54] have mea-  already been some designs combining the CNN accelerator
         sured the performance gap between FPGA and ASIC. It is  architecture with HMC [60][61]. Another technology is to
         said that the critical-path delay of ASIC is three to four times  embed the computation inside memory, such as memristor




                                             ‹ ,QWHUQDWLRQDO 7HOHFRPPXQLFDWLRQ 8QLRQ
   30   31   32   33   34   35   36   37   38   39   40