Page 35 - First special issue on The impact of Artificial Intelligence on communication networks and services
P. 35
,78 -2851$/ ,&7 'LVFRYHULHV 9RO 0DUFK
Figure 7. Hardware designs of CNN accelerators on different platforms and development route for RTAV accelerator in
ADAS.
(Source by:https://nicsefc.ee.tsinghua.edu.cn/projects/neural-network-accelerator/)
[51] shows the comparison between neural network acceler- less than FPGAs, and the dynamic power consumption ratio
ators, as depicted in Fig. 7. We can see from the image that is approximately 14 for FPGA to ASIC, while the average
GPUs are among the top tier of computing speeds, but the chip area of ASIC is also 18 times smaller than FPGA. This
power consumption is also very high. The freshly released means we can realize a much better performance with ASIC
NVIDIA Tesla V100 can get an astounding computing speed within a given hardware area. ASIC designs have the rela-
of 120 TOPS [52], with a power consumption of 300 W. This tively better energy efficiency, mostly between 100 GOPS/W
can be useful in datacenter scenarios for cases like model to 10 TOPS/W. They have shown excellent performance in
training where power is not the main concern. There are low-power area, and as we can see from Fig. 7 some repre-
also some GPUs designed for low-power embedded environ- sentative designs such as DianNao [55], Eyeriss [56] and En-
ments, like NVIDIA Jetson TX1 mobile GPU, which brings vision [57] are showing a performance of around 100 GOPS
a 300 GOPS speed on VGG-16 and a peak performance of with only milli-watt level power consumption. The efficiency
1 TOPS with only a 10 W cost [53]. The large general pur- can even reach 10 TOPS/W at extreme low voltage status.
pose stream processors on chip might bring a considerable To the other side, those ASICs with larger chip sizes are
parallelism, but the efficiency remains a question. With the capable of offering more abundant PEs and memory band-
technology of 28nm, the NVIDIA TITAN-X and TX1 GPU width, which can lead to a faster throughput speed, such as
can only get an efficiency of 20-100 GOPS/W. Googles TPU [58] which can get a peak performance of 86
TOPS. From the business aspect, a large quantity production
To improve the efficiency, we need to customize the inside
of ASIC could also reduce the overall cost. However, note
logic of processing elements (PEs) to enhance processing
that the deep-learning algorithms for RTAV have a quite short
parallelism and optimize memory access patterns. FPGA
evolving cycle, usually within six to nine months. Moreover,
could be a suitable initial selection, since it can provide
the benchmarks for RTAV are also far from perfect and new
a large amount of computing and memory resources and
tasks appear nearly every year. While ASICs time to market
enough reconfigurability with programmable interconnec-
is no less than one year, there is a potential risk of incom-
tion to map common algorithms on. In Fig. 7 we can see
patibility between hardware processors and fresh algorithms
that there have been many FPGA designs. The top designs,
and application scenes. Solution providers need to make a
including our Aristotle system on the Xilinx ZU9 FPGA plat-
risk-return analysis.
form, can get a throughput speed at around 2 TOPS, which
is quite close to the same technology generation NVIDIA Recently, some breakthroughs have taken place in the area
TITAN-X GPUs, but of almost 10 times better efficiency. of near-memory and in-memory computing. The 3-D mem-
This proves the capability of FPGA of being a strong com-
ory can offer an order of magnitude higher bandwidth and
petitor.
several times power consumption than 2-D memory, such
As we can see, most CNN layers consist of MAC operations as Hyper Memory Cube (HMC) proposed by Micron [59],
and have similar computing patterns which could be possibly which uses through silicon vias (TSV) to stack the dynamic
generalized and parameterized. Therefore, with mature hard- random-access memory (DRAM) on top of the logic cir-
ware architecture and processing flow, it is feasible to harden cuit. Through this method, the memory bandwidth can be
the original FPGA accelerator design into an ASIC chip with increased by an order of magnitude from 2-D memory, and
a programmable interface for reconfigurability, which can the power consumption can be five times less. There have
further improve performance. Kuon et al. [54] have mea- already been some designs combining the CNN accelerator
sured the performance gap between FPGA and ASIC. It is architecture with HMC [60][61]. Another technology is to
said that the critical-path delay of ASIC is three to four times embed the computation inside memory, such as memristor
,QWHUQDWLRQDO 7HOHFRPPXQLFDWLRQ 8QLRQ