摘要
近年来,得益于计算机运算能力的提升和互联网所产生的大量数据,深度学习(DL)技术取得了快速发展,其中最显著的卷积神经网络(CNN)在图像识别、目标检测、自然语言处理等领域已经成功实现商用。然而随着网络层数越来越深,对计算能力和内存需求急剧上升,如何对卷积神经网络进行加速并在硬件加速器上部署的问题逐渐成为学术界研究的热点。从现场可编程门阵列(FPGA)开发神经网络的优势出发,介绍了FPGA的多种开发方式,详细论述了部署和加速卷积神经网络的各种优化策略,以及采用不同优化策略的FPGA卷积神经网络加速器的性能表现。最后,展望了FPGA卷积神经网络加速器的未来发展方向。
深度学习(DL),特别是卷积神经网络(CNN)的快速发展,在识别精确度上的显著提升,促进了目标检测、人脸识别、语音识别、自动驾驶等应用场景的实现,未来将会深刻改变人们的生活方式,推动社会经济快速发展。然而,随着模型越来越深,计算量越来越大,以及对实时性、低功率等应用场景的需求,使得神经网络加速问题日益凸显。
神经网络包括训练和推理两个阶段,训练阶段对数据精确度要求较高,通常采用浮点数据类型。目前,有TensorFlow、PyTorch、Caffe等深度学习库可以很方便地利用图形处理器(Graphics Processing Unit,GPU)的算力加速训练过程。而推理阶段通常具有实时性、低功耗等现实需求,现场可编程门阵列(FPGA)为实现神经网络前向推理加速提供了有吸引力的解决方案,因为FPGA平台具有灵活编程、数据流驱动、高能效、低功耗等特点,可以通过硬件优化映射任何给定的神经网络算法,针对快速更迭的卷积神经网络具有优势,近些年来逐渐成为学术界研究的热点。本文主要讨论FPGA在网络推理阶段的加速研究。
CNN理论最早在20世纪60年代由David Hubel和Torsten Wiesel提出,他们受到猫的大脑视觉神经传递信息启发,提出了感受
近几年,各种轻量级神经网络陆续被提出,在不降低网络准确度的同时,大大降低了计算量和参数量,如SqueezeNe
FPGA是一种半定制硬件电路,可以通过编程的方式改变其中的查找表(Look Up Table,LUT),从而实现各种电路逻辑。FPGA通常包括可编程逻辑单元、可编程IO、布线资源等,为了灵活布线,在可编程连线的某些位置放置了可编程开关矩阵。由于集成工艺的不断发展,目前FPGA还包括可配置的内存模块和DSP(一般由专用乘法器构建)等,FPGA的基本结构示意图如

图1 FPGA的基本结构
Fig.1 Basic structure of FPGA
自1984年,赛灵思(Xilinx)推出第一款商用FPGA-XC2064以来,FPGA已经发展了30多年,随着制作工艺的不断提升,目前最大的FPGA已经成为包含数百亿只晶体管的庞大可编程逻辑电路。面对高度复杂的硬件逻辑电路,如何高效编程完成各领域算法的部署,一直是FPGA供应商着力解决的重点问题。接下来,将详细介绍几种具有代表性的FPGA开发方式。
寄存器传输级(Register Transfer Level,RTL)开发描述了时钟精确控制下寄存器到寄存器之间的逻辑功能,采用的是层级较低的硬件描述语言(Hardware Description Language,HDL)。硬件工程师利用这种语言可以至上而下采用模块化设计的方法,通过类似搭积木的方式逐层构建出复杂的数字系统。
基于RTL的开发方式更贴近底层硬件,经过专业设计后的硬件电路执行效率高,但需要开发者对算法结构和FPGA架构都有很深入的了解,且设计过程相对繁琐,一定程度上限制了软件、算法工程师在FPGA上设计实现各种CNN模型。所以学术界和业界迫切需要采用高层次的开发模式来提高FPGA的易用性和开发效率。
高层次综合(High-Level Synthesis,HLS)通常是指将C、C++、OpenCL等具有较高抽象度的编程语言描述的逻辑结构,通过编译工具自动转换成诸如Verilog、VHDL、System Verilog等低抽象级语言描述的电路模型的过程。
通过高层次综合的方法能够屏蔽FPGA的底层逻辑和实现细节,使得软件和算法工程师不用关注底层逻辑,将精力放在算法实现上,快速完成设计迭代。使用像C、C++等高层语言对硬件系统建模,可以将代码密度压缩到原来的10%~14

图2 高层次综合流程示意图
Fig.2 Diagram of High-Level Synthesis process
CPU、GPU的软件编译器经过几十年的发展已经相对成熟,借助Caffe、TensorFlow、PyTorch等深度学习库都能够轻松地完成神经网络的部署。然而针对FPGA的自动化映射工具还远远不够成熟,不能通过简单调用编译器完成从网络模型到硬件的部署。
目前基于FPGA的自动化设计工具链主要分为两种。一种是使用知识产权(Intellectual Property,IP)核组建专用电路的实现方案,如Caffein
虽然采用上述3种开发方式都可以完成神经网络加速器的设计,但是其开发效率和实现的效果却有很大的差异,各种开发方式都有其优势和劣势,适用人群也各有不同。
development mode | programming language | advantages | disadvantages | applicable objects |
---|---|---|---|---|
RTL | HDL | close to the bottom layer, fine-grained modeling can be carried out, and the circuit execution efficiency is high | difficult to develop and long design cycle | hardware engineer |
HLS |
C, C++, System-C,OpenCL | using high-level programming languages, shielding hardware details, and short design cycles | it requires understanding the basic FPGA structure and design constraints, and has a large implementation granularity | software engineer |
automated tool chain |
C++, Python etc. | a convenient and efficient end-to-end design process that is favored by large companies and developing rapidly | the support for board cards and network models is not comprehensive enough and not yet mature | AI algorithm engineer |
加速器的设计本质上是特定领域算法在硬件上部署和优化的问题。在进行FPGA加速设计时,通常采用简化算法模型和优化硬件结构的方式实现算法在硬件平台的高效运行。基于此,本节从网络模型优化、卷积算法优化以及硬件实现优化3个方面进行阐述。
近几年,随着网络深度的增加,模型参数量和运算量日益庞大,限制了神经网络的应用场景。对网络模型进行优化压缩已经成为实际场景中部署神经网络的必要手段。模型压缩的优势在于减少硬件实现的计算和存储负担,神经网络算法的健壮性使得适当压缩模型带来的精确度损失可以忽略不计。根据不同压缩和加速方法的性质,网络模型优化压缩方法主要分为量化、剪枝、轻量型网络3种。
量化是一种减少神经网络模型数据位宽的压缩技术。在训练CNN模型阶段为了寻求较高的模型精确度,通常采用单精度浮点数表示。然而推理应用阶段通常具有低延迟、低功耗等需求,可以采用低位宽的数据表示,以减少在硬件上的存储需求和读写数据所带来的能耗,同时确保网络性能不出现明显下降。
相比CPU、GPU中采用的固定位宽(以8 bit为边界)数据进行运算,FPGA的一个显著优势是可以在运算时采用任意位宽的数据,从而使数据位宽的选择更加灵活。文献[
量化是网络模型硬件部署中最常用的优化策略之一,目前,主流的神经网络设计框架如PyTorch、TensorFlow等都有成熟的网络量化方案,可以快捷地完成网络模型量化。量化的好处主要有提升计算吞吐量,降低传输带宽,占用更少内存容量等,实现的难点主要在于在硬件上部署时如何达到如算法所展示的性能提升。
卷积神经网络存在着冗余,即使通过剪枝将某些不重要的连接上的权重设置为0,也能够表达出模型特征。在剪枝时,剪枝阈值的选择是设计的关键,剪枝过大会影响模型的准确表达,而恰当的剪枝则对精确度损失较小甚至没有损失,所以剪枝过程需要在压缩模型尺寸和保持精确度之间做出权衡。剪枝前后的模型如

图3 网络剪枝示意图
Fig.3 Schematic diagram of network pruning
近年来,大量的卷积网络模型采用线性整流函数(Rectified Linear Unit,ReLU)作为激活函数。研究表
细粒度剪枝的缺陷在于引入了稀疏性问题,计算数据的路径不再规则,导致在FPGA上进行并行流水化时,各个数据路径上负载不平衡。采用逐层、逐通道等结构化剪枝的方法可以减少因剪枝导致的稀疏性问题,文献[

图4 网络剪枝方式对比图
Fig.4 A comparison of different network pruning methods
剪枝和量化都是在高性能的网络模型基础上降低计算复杂度,而轻量化网络专注于设计高效的网络模块,如全局平均池化、深度可分离卷积、分组卷积等。这些网络结构具有参数量和计算量小、占用内存少、运算速度快等优点,可以在大幅压缩模型参数的情况下,仍然保持较高的精确度。轻量化网络模型有利于减少计算和带宽压力,加快推理速度,是卷积神经网络未来的一种发展方向。但由于轻量化网络为了保持准确率,模型尺寸(包括深度和宽度)不断增加,对硬件设计提出了新的挑战。
BAI
CNN中的卷积层是计算密集型的,在LeNet-5、AlexNet、VGG16等经典卷积神经网络中,卷积运算占据了执行时间的90%以
快速傅里叶变换(Fast Fourier Transform,FFT)是在信号处理中经常使用的方法,使用FFT能够将时域中的卷积操作转换到频域的点乘运算。文献[
(1) |
很多加速器设计都采用了频域卷积实现算法。KO
FFT 算法将卷积运算转换到频域进行计算, 但在卷积核尺寸较大时效果才明显,而CNN的发展趋势是采用 3×3、1×1等小卷积核,因此对于采用小卷积核的CNN加速效率并不高。
Winograd算
Podili
与FFT算法相比,Winograd算法具有更低的计算复杂度,但转换代价也更高,这是因为随着Winograd算法参数量增大,常数矩阵中的值会骤
GPU和DSP常常采用通用矩阵乘法(General Matrix Multiplication,GEMM)的方式实现CNN,并借助OpenBLAS、cuBLAS等矩阵运算库和CUBA编程工具加速,实现对卷积神经网络运算的高效执行。在FPGA中采用GEMM算法的优势是将卷积运算转换为易于实现的矩阵乘法,能够灵活适应不同的网络模型和结构,并可以通过配置控制加速器的并行度。
为了提高加速器的吞吐量,同时让加速器可以兼容其他CNN模型,文献[
将步长为1的卷积操作转化为GEMM的过程如

图5 将卷积操作转化为GEMM
Fig.5 Converting convolution operation to GEMM operation
item | FFT | Winograd | GEMM | ||||
---|---|---|---|---|---|---|---|
[ | [ | [ | [ | [ | [ | [ | |
year | 2017 | 2020 | 2017 | 2017 | 2022 | 2020 | 2021 |
device | Strati5‒V QPI | Alveo U200 | Arria10 GX1150 | ZCU102 | XC7K325T | XCVU9P | XC7Z045 |
network | VGG | VGG | AlexNet | VGG | VGG | MobileNet V1 | MobileNet V2 |
datatype | fp32 | fix16 | fp16 | fix16 | fix8 | int8 | Flexible |
frequeny/MHz | 200 | 200 | 303 | 200 | 150 | 300 | 100 |
DSP | 224 | 2 680 | 1 576 | 2 520 | 840 | 5149 | 900 |
LUT/K | 201 | 230 | 246 | 600 | — | 246 | 193 |
throughput/GOPS | 123 | 1 699 | 1 382 | 3 045 | 441 | 3651 | 364 |
本节讨论在基于FPGA的卷积神经网络加速器设计中常用的硬件优化策略,包括循环优化、脉动阵列、完全片上映射和多层融合等。文献[
卷积运算可以表示为6层嵌套循环的形式,在最内层完成卷积的乘累加运算,如

图6 卷积层代码
Fig.6 Convolutional layer code
硬件设计中,对于卷积循环的优化方法主要有3种,分别是循环展开、循环交换和循环分块。循环展开决定了硬件加速的并行度,由于卷积运算主要是数组乘加运算,所以展开程度受到FPGA片上DSP和块随机存取存储器(Block Random Access Memory,BRAM)等资源数量限制;循环交换会改变运算先后顺序,可以将需要展开的维度放在内层嵌套,也可用于消除数据依赖;循环分块可以将大特征图划分为小特征图,解决FPGA片上资源有限导致无法计算和存储完整特征图的问题,采用小特征图还能够使不同网络层复用相同的硬件加速模块。
文献[
47.97 ms的端到端延迟。文献[
当硬件资源利用率较高时,FPGA的时序约束越来越难以满足,导致加速器运行频率较低,影响了性能提升。其中一个重要原因是当硬件资源利用率高时,FPGA的布局布线越来越困难,关键路径的时延影响了加速器的运行频率提升。脉动阵列架构采用深度流水化的处理单元(Processing Element,PE)间的短数据通路替代原来长广播、多扇入扇出的数据通路,能够在充分利用片上资源的同时保证较高的运行频率,典型的脉动阵列结构如

图7 脉动阵列结构示意图
Fig.7 Diagram of systolic array
文献[
由于FPGA硬件资源限制,加速器大多采用可配置的通用PE阵列,神经网络的各层依次在PE阵列中执行。采用这种方式时,中间特征图存储在外部存储器中,增加了片外访存压力,影响数据的吞吐量提升。研究表
加速器设计的另一种思路是将神经网络完全映射到片
这种将神经网络完全映射到片上的架构减少了外部存储的访问,但需要大量的片上内存资源来存储中间特征图和权重,对于较大的CNN模型可能会很容易耗尽所有可用的片上存储器,因此往往仅限于部署小型的神经网络。
通常设计中,CNN在加速器上逐层执行,上一层卷积计算结束后的输出特征图需要存放到片外存储器,下一层运算时再从片外存储器中读取数据作为输入特征图,导致了对片外存储器的频繁读写。多层融合的思想是将第一个输入层的数据读取到片上,然后多个卷积层在片上一起计算,仅将最后一层的输出特征图送到片外存储器。其目的是减少大型神经网络的中间数据在片外存储器上的频繁读写。
多层融合技术还可以利用相邻层之间的数据复用,进一步减少了对片外存储的访问。如

图8 层融合示意图
Fig.8 Diagram of layer fusion
Alwani
FPGA加速器的设计涉及到开发方式,算法优化策略,硬件实现优化策略等多方面的选择和权衡。网络模型优化和卷积算法优化都属于算法层面的优化策略,其目的在于转换模型算法和降低计算复杂度。而硬件实现优化与具体的硬件结构密切相关,需要考虑不同数据通路的时延和功耗等问题。
type | strategy | method | optimization | limitations |
---|---|---|---|---|
network model optimization | quantization | reduce the bit width of weights or activation data | reduce storage requirements | binary quantized networks suffer from significant accuracy degradation |
pruning | set unimportant weights or activations to zero | compress the network model to accelerate computation | introduce sparsity | |
lightweight network | use efficient computation modules to compress the model's computation | reduce the number of parameters and computation | increase the number of network layers | |
convolution algorithm optimization | FFT | convert convolution operations to frequency domain multiplication operations | reduce computational complexity | the effect is not obvious for small convolutional kernels |
Winograd |
convert convolution operations to Winograd algorithm | reduce computational complexity | high demand for storage bandwidth | |
GEMM | convert convolution operations to matrix multiplication | easy to implement and increase versatility | occupies additional storage resources | |
hardware implementation optimization | loop optimization |
unroll, swap, and block the six- level nested loops | fully utilize hardware resources for parallel computing |
large design space requiring modeling and analysis |
systolic array | employ a deeply pipelined short data path | optimize timing constraints to increase operating frequency | difficult to design | |
fully on-chip,mapping | implement specific hardware modules for each layer of the network | reduce off-chip data movement | only suitable for small network models | |
multi-layer fusion | complete multi-layer convolution operations within the FPGA | reduce off-chip data movement and increase data reuse | large design space |
加速器设计时往往需要综合使用各种优化策略,从而实现最优的实现效果。如文献[
目前神经网络加速器设计主要是针对推理任务,用于训练任务的加速器还比较少。卷积神经网络的训练任务对计算精确度的要求较高,且涉及到梯度计算和权重更新,需要反复的迭代运算,因此计算复杂度也更高。由于卷积神经网络的训练和推理采用相似的计算模式,未来,采用FPGA加速卷积神经网路训练可能会是研究热点之一。
以往的CNN加速器设计中,通常会使用FPGA中的DSP资源进行乘法运算,而对查找表(LUT)资源并未充分利用。在二值神经网络运算中,将乘加运算转化为了同或运算,用丰富的LUT取代紧张的DSP乘法器资源,可以大大提高面积效率。二值神经网络和LUT结构的结合给FPGA加速神经网络提供了新的思路,如何提高推理精确度将成为相关研究的关键。
针对大规模FPGA神经网络加速应用,学术界在自动化部署、模型压缩、快速算法、并行计算、数据复用等方面取得了显著成果。FPGA加速器的运算吞吐量、功耗与CPU、GPU相比具有优势,目前互联网知名企业如微软、百度等的云数据中心都在大规模部署FPGA加速器,而在边缘设备中,FPGA也可以完成神经网络模型的加速推理。
人工智能技术的加速应用已经成为FPGA发展的战略方向之一。深度神经网络算法的发展日新月异,FPGA在硬件加速上具有独特的优势;但同时,FPGA及其开发方式也需要不断更新和进步,适应层出不穷的新算法、新应用和新要求,才能在人工智能时代创造新的辉煌。
参考文献
HUBEL D H,WIESEL T N. Receptive fields, binocular interaction and functional architecture in the cat's visual cortex[J]. The Journal of Physiology, 1962,160(1):106-154. doi:10.1113/jphysiol.1962.sp006837. [百度学术]
LECUN Y,BOTTOU L,BENGIO Y,et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998,86(11):2278-2324. doi:10.1109/5.726791. [百度学术]
KRIZHEVSKY A,SUTSKEVER I,HINTON G E. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017,60(6):84-90. doi:10.1145/3065386. [百度学术]
SIMONYAN K,ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. (2014-09-04)[2022-04-14]. https://arxiv.org/abs/1409.1556. [百度学术]
HE Kaiming,ZHANG Xiangyu,REN Shaoqing, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Las Vegas:IEEE, 2016:770-778. doi:10.1109/CVPR.2016.90. [百度学术]
IANDOLA F N,HAN S,MOSKEWICZ M W,et al. SqueezeNet:AlexNet-level accuracy with 50x fewer parameters and<0.5 MB model size[EB/OL]. (2016-02-24) [2022-04-14]. https://arxiv.org/abs/1602.07360. [百度学术]
HOWARD A G,ZHU M L,CHEN B,et al. MobileNets: efficient convolutional neural networks for mobile vision applications[EB/OL]. (2017-04-17) [2022-04-14]. https://arxiv.org/abs/1704.04861. [百度学术]
SANDLER M,HOWARD A,ZHU M L,et al. MobileNetV2:inverted residuals and linear bottlenecks[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City:IEEE, 2018:4510-4520. doi:10.1109/CVPR.2018. 00474. [百度学术]
HOWARD A,SANDLER M,CHEN B,et al. Searching for MobileNetV3[C]// 2019 IEEE/CVF International Conference on Computer Vision(ICCV). Seoul:IEEE, 2019:1314-1324. doi:10.1109/ICCV.2019.00140. [百度学术]
ZHANG Xiangyu,ZHOU Xinyu,LIN Mengxiao,et al. ShuffleNet: an extremely efficient convolutional neural network for mobile devices[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City:IEEE, 2018:6848-6856. doi:10.1109/CVPR.2018.00716. [百度学术]
MA Ningning,ZHANG Xiangyu,ZHENG Haitao,et al. ShuffleNet V2:practical guidelines for efficient CNN architecture [C]// Computer Vision-ECCV 2018. Cham:Springer International Publishing, 2018:122-138. doi:10.1007/978-3-030-01264-9_8. [百度学术]
CONG J,LIU B,NEUENDORFFER S,et al. High-level synthesis for FPGAs:from prototyping to deployment[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2011,30(4):473-491. doi:10.1109/TCAD.2011. 2110592. [百度学术]
ZHANG Chen,FANG Zhenman,ZHOU Peipei,et al. Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks[C]// 2016 IEEE/ACM International Conference on Computer-Aided Design(ICCAD). Austin: IEEE, 2016:1-8. doi:10.1145/2966986.2967011. [百度学术]
VENIERIS S I,BOUGANIS C S. fpgaConvNet:a framework for mapping convolutional neural networks on FPGAs[C]// 2016 IEEE the 24th Annual International Symposium on Field-Programmable Custom Computing Machines(FCCM). Washington:IEEE, 2016:40-47. doi:10.1109/FCCM.2016.22. [百度学术]
UMUROGLU Y,FRASER N J,GAMBARDELLA G,et al. FINN:a framework for fast,scalable binarized neural network inference[C]// Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. Monterey:Association for Computing Machinery, 2017:65-74. doi:10.1145/3020078.3021744. [百度学术]
DUARTE J,HAN S,HARRIS P,et al. Fast inference of deep neural networks in FPGAs for particle physics[J]. Journal of Instrumentation, 2018(13):P07027. doi:10.1088/1748-0221/13/07/P07027. [百度学术]
PLAGWITZ P,HANNIG F,STRÖBEL M,et al. A safari through FPGA-based neural network compilation and design automation flows[C]// 2021 IEEE the 29th Annual International Symposium on Field-Programmable Custom Computing Machines(FCCM). Orlando:IEEE, 2021:10-19. doi:10.1109/FCCM51124.2021.00010. [百度学术]
AMD. Vitis AI[EB/OL]. [2022-04-14]. https://china.xilinx.com/products/design-tools/vitis/vitis-ai.html. [百度学术]
MOREAU T,CHEN Tianqi,VEGA L,et al. VTA:an open hardware-software stack for deep learning[EB/OL]. (2018-07-11)[2022-04-14]. https://arxiv.org/abs/1807.04188v2. [百度学术]
MathWorks. Deep learning processor IP core[EB/OL]. [2022-04-14]. https://www.mathworks.cn/help/deep-learning-hdl/ug/deep-learning-processor-ip-core.html. [百度学术]
SUDA N,CHANDRA V,DASIKA G,et al. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks[C]// Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. Monterey:Association for Computing Machinery, 2016:16-25. doi:10.1145/2847263.2847276. [百度学术]
孙小坚,林瑞全,方子卿,等. 基于FPGA加速的低功耗的MobileNetV2网络识别系统[J]. 计算机测量与控制, 2023,31(5): 221-227,234. [百度学术]
SUN Xiaojian,LIN Ruiquan,FANG Ziqing,et al. Low-power mobileNetV2 network identification system based on FPGA acceleration[J]. Computer Measurement & Control, 2023,31(5):221-227,234. doi:10.16526/j.cnki.11-4762/tp.2023.05.033. [百度学术]
LIANG Shuang,YIN Shouyi,LIU Leibo,et al. FP-BNN: binarized neural network on FPGA[J]. Neurocomputing, 2018(275): 1072-1086. doi:10.1016/j.neucom.2017.09.046. [百度学术]
HAN S,MAO H Z,DALLY W J. Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding[EB/OL]. (2015-10-01) [2022-04-14]. https://arxiv.org/abs/1510.00149. [百度学术]
ALBERICIO J,JUDD P,HETHERINGTON T,et al. Cnvlutin: ineffectual-neuron-free deep neural network computing[J]. ACM SIGARCH Computer Architecture News, 2016,44(3):1-13. doi:10.1145/3007787.3001138. [百度学术]
NURVITADHI E,VENKATESH G,SIM J,et al. Can FPGAs beat GPUs in accelerating next-generation deep neural networks[C]// Proceedings of the 2017 ACM/SIGDA international symposium on field-programmable gate arrays. Monterey:Association for Computing Machinery, 2017:5-14. doi:10.1145/3020078.3021740. [百度学术]
VÉSTIAS M. Efficient design of pruned convolutional neural networks on FPGA[J]. Journal of Signal Processing Systems, 2021, 93(5):531-544. doi:10.1007/s11265-020-01606-2. [百度学术]
FAN Yingbo,PANG Wei,LU Shengli. HFPQ: deep neural network compression by hardware-friendly pruning-quantization[J]. Applied Intelligence, 2021,51(10):7016-7028. doi:10.1007/s10489-020-01968-x. [百度学术]
BAI Lin,ZHAO Yiming,HUANG Xinming. A CNN accelerator on FPGA using depthwise separable convolution[J]. IEEE Transactions on Circuits and Systems II―Express Briefs, 2018,65(10):1415-1419. doi:10.1109/TCSII.2018.2865896. [百度学术]
HUANG Qijing,WANG Dequan,DONG Zhen,et al. CoDeNet:efficient deployment of Input-Adaptive object detection on embedded FPGAs[C]// The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. Virtual Event: Association for Computing Machinery, 2021:206-216. doi:10.1145/3431920.3439295. [百度学术]
蒋璨. 一种关于卷积神经网络的加速结构[D]. 成都:电子科技大学, 2020. [百度学术]
JIANG Can. An acceleration structure of convolutional neural network[D]. Chengdu,China:University of Electronic Science and Technology of China, 2020. [百度学术]
刘志强. 基于FPGA的卷积神经网络加速器关键技术研究[D]. 长沙:国防科技大学, 2019. [百度学术]
LIU Zhiqiang. Research on FPGA-based accelerator design for convolutional neural networks[D]. Changsha,China:National University of Defense Technology, 2019. doi:10.27052/d.cnki.gzjgu.2019.000033. [百度学术]
KO J H,MUDASSAR B,NA T,et al. Design of an energy-efficient accelerator for training of convolutional neural networks using frequency-domain computation[C]// 2017 the 54th ACM/EDAC/IEEE Design Automation Conference(DAC). Austin:IEEE, 2017:1-6. doi:10.1145/3061639.3062228. [百度学术]
ZHANG Chi,PRASANNA V. Frequency domain acceleration of convolutional neural networks on CPU-FPGA shared memory system[C]// Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. Monterey: Association for Computing Machinery, 2017:35-44. doi:10.1145/3020078.3021727. [百度学术]
ZENG Hanqing,CHEN Ren,ZHANG Chi,et al. A framework for generating high throughput CNN implementations on FPGAs[C]//Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. Monterey:Association for Computing Machinery, 2018:117-126. doi:10.1145/3174243.3174265. [百度学术]
WINOGRAD S. Arithmetic complexity of computations Philadelphia[M]. Philadelphia:Society for Industrial and Applied Mathematics, 1980. [百度学术]
PODILI A,ZHANG Chi,PRASANNA V. Fast and efficient implementation of Convolutional Neural Networks on FPGA[C]// 2017 IEEE the 28th International Conference on Application-specific Systems,Architectures and Processors(ASAP). Seattle:IEEE, 2017: 11-18. doi:10.1109/ASAP.2017.7995253. [百度学术]
SHEN Junzhong,HUANG You,WANG Zelong,et al. Towards a uniform template-based architecture for accelerating 2D and 3D CNNs on FPGA[C]// Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. Monterey:Association for Computing Machinery, 2018:97-106. doi:10.1145/3174243.3174257. [百度学术]
AYDONAT U,O'CONNELL S,CAPALIJA D,et al. An OpenCL™ deep learning accelerator on arria 10[C]// Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. Monterey:Association for Computing Machinery, 2017:55-64. doi:10.1145/3020078.3021738. [百度学术]
LU Liqiang,LIANG Yun,XIAO Qingcheng,et al. Evaluating fast algorithms for convolutional neural networks on FPGAs[C]//2017 IEEE the 25th Annual International Symposium on Field-Programmable Custom Computing Machines(FCCM). Napa:IEEE, 2017:101-108. doi:10.1109/FCCM.2017.64. [百度学术]
卢丽强,郑思泽,肖倾城,等. 面向卷积神经网络的FPGA设计[J]. 中国科学:信息科学, 2019,49(3):277-294. [百度学术]
LU Liqiang, ZHENG Size,XIAO Qingcheng,et al. Accelerating convolutional neural networks on FPGAs[J]. Science China Information Sciences, 2019,49(3):277-294. [百度学术]
GONG Yu,XU Zhihan,HE Zhezhi,et al. N3H-Core: neuron-designed neural network accelerator via FPGA-based heterogeneous computing cores[C]// Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. Virtual Event:Association for Computing Machinery, 2022:112-122. doi:10.1145/3490422.3502367. [百度学术]
ZHANG Wentai,JIANG Ming,LUO Guojie. Evaluating low-memory GEMMs for convolutional neural network inference on FPGAs[C]// 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines(FCCM). Fayetteville:IEEE, 2020:28-32. doi:10.1109/FCCM48280.2020.00013. [百度学术]
NIU Yue,KANNAN R,SRIVASTAVA A,et al. Reuse kernels or activations? a flexible dataflow for low-latency spectral CNN acceleration[C]// Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. Seaside: Association for Computing Machinery, 2020:266-276. doi:10.1145/3373087.3375302. [百度学术]
HUANG Chengcheng,DONG Xiaoxiao,LI Zhao,et al. Efficient stride 2 winograd convolution method using unified transformation matrices on FPGA[C]// 2021 International Conference on Field-Programmable Technology(ICFPT). Auckland:IEEE, 2021:1-9. doi:10.1109/ICFPT52863.2021.9609907. [百度学术]
吴艳霞,梁楷,刘颖,等. 深度学习FPGA加速器的进展与趋势[J]. 计算机学报, 2019,42(11):2461-2480. [百度学术]
WU Yanxia,LIANG Kai,LIU Ying,et al. The progress and trends of FPGA-based accelerators in deep learning[J]. Chinese Journal of Computers, 2019,42(11):2461-2480. doi:10.11897/SP.J.1016.2019.02461. [百度学术]
秦华标,曹钦平. 基于FPGA的卷积神经网络硬件加速器设计[J]. 电子与信息学报, 2019,41(11):2599-2605. [百度学术]
QIN Huabiao, CAO Qinping. Design of convolutional neural networks hardware acceleration based on FPGA[J]. Journal of Electronics & Information Technology, 2019,41(11):2599-2605. doi:10.11999/JEIT190058. [百度学术]
MA Yufei,CAO Yu,VRUDHULA S,et al. Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks[C]// Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. Monterey:Association for Computing Machinery, 2017:45-54. doi:10.1145/3020078.3021736. [百度学术]
MA Y F,CAO Y,VRUDHULA S,et al. Optimizing the convolution operation to accelerate deep neural networks on FPGA[J]. IEEE Transactions on Very Large Scale Integration(VLSI) Systems, 2018,26(7):1354-1367. doi:10.1109/TVLSI.2018.2815603. [百度学术]
WEI X C,YU C H,ZHANG P,et al. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs[C]// Proceedings of the 54th Annual Design Automation Conference 2017. Austin:Association for Computing Machinery, 2017:29. doi:10.1145/3061639.3062207. [百度学术]
LIU L Q,BROWN S. Leveraging fine-grained structured sparsity for CNN inference on systolic array architectures[C]// 2021 the 31st International Conference on Field-Programmable Logic and Applications(FPL). Dresden:IEEE, 2021:301-305. doi: 10.1109/FPL53798.2021.00060. [百度学术]
HOROWITZ M. 1.1 Computing's energy problem (and what we can do about it)[C]// 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC). San Francisco,CA,USA:IEEE, 2014:10-14. doi:10.1109/ISSCC. 2014.6757323. [百度学术]
LIU Zhiqiang,DOU Yong,JIANG Jingfei,et al. Throughput-Optimized FPGA accelerator for deep convolutional neural networks[J]. ACM Transactions on Reconfigurable Technology and Systems, 2017,10(3):17. doi:10.1145/3079758. [百度学术]
NGUYEN D T,NGUYEN T N,KIM H,et al. A high-throughput and power-efficient FPGA implementation of YOLO CNN for object detection[J]. IEEE Transactions on Very Large Scale Integration(VLSI) Systems,2019,27(8):1861-1873. doi:10.1109/TVLSI.2019.2905242. [百度学术]
LI Huimin,FAN Xitian,JIAO Li,et al. A high performance FPGA-based accelerator for large-scale convolutional neural networks[C]// 2016 the 26th International Conference on Field Programmable Logic and Applications(FPL). Lausanne:IEEE, 2016:1-9. doi:10.1109/FPL.2016.7577308. [百度学术]
HALL M,BETZ V. HPIPE: heterogeneous layer-pipelined and sparse-aware CNN inference for FPGAs[C]// The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays(FPGA'20). New York:Association for Computing Machinery, 2020:320. doi:10.1145/3373087.3375380. [百度学术]
ALWANI M,CHEN H,FERDMAN M,et al. Fused-layer CNN accelerators[C]// 2016 the 49th Annual IEEE/ACM International Symposium on Microarchitecture(MICRO). Taipei,Taiwan,China:IEEE, 2016:1-12. doi:10.1109/MICRO.2016.7783725. [百度学术]
XIAO Qingcheng,LIANG Yun,LU Liqiang,et al. Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs[C]// 2017 the 54th ACM/EDAC/IEEE Design Automation Conference(DAC). Austin:IEEE, 2017:1-6. doi: 10.1145/3061639.3062244. [百度学术]
MENG J,VENKATARAMANAIAH S K,ZHOU C T,et al. FixyFPGA:efficient FPGA accelerator for deep neural networks with high Element-Wise sparsity and without external memory access[C]// 2021 the 31st International Conference on Field-Programmable Logic and Applications(FPL). Dresden:IEEE, 2021:9-16. doi:10.1109/FPL53798.2021.00010. [百度学术]