深度學習神經網路深度可分離卷積模塊之輕量化硬體架構設計_

帳號：guest(3.142.53.11) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者:	謝林鎰鑫
作者(英文):	Yi-Sin Sie Lin
論文名稱:	深度學習神經網路深度可分離卷積模塊之輕量化硬體架構設計
論文名稱(英文):	Lightweight hardware architecture design of deep learning neural network depthwise separable convolution module
指導教授:	黃振榮
指導教授(英文):	Chenn-Jung Huang
口試委員:	陳亮均陳恆鳴
口試委員(英文):	Liang-Chun Chen Heng-Ming Chen
學位類別:	碩士
校院名稱:	國立東華大學
系所名稱:	資訊工程學系
學號:	611021239
出版年(民國):	112
畢業學年度:	111
語文別:	中文
論文頁數:	61
關鍵詞:	人工智慧、深度學習、神經網路、卷積、人工智慧加速器、電路設計
關鍵詞(英文):	Artificial Intelligence、Deep Learning、Neural Network、Convolution、AI Accelerator、Circuit Design
相關次數:	推薦:1 點閱:46 評分: 下載:28 收藏:0

近年隨著移動裝置的普及和計算能力的提升，深度學習神經網路在移動端邊緣應用領域取得了顯著的進展，但面臨一些挑戰。移動端的計算能力、存儲容量、傳輸帶寬較低，對於複雜模型和大規模資料處理存在限制。因此，合適的深度學習模型，搭配專用的硬體設計，提供現存的挑戰的解決方案。
本文提出深度學習神經網路深度可分離卷積模塊之輕量化硬體架構設計。從三個面向實現輕量化，第一是使用深度可分離卷積取代標準卷積，降低計算量。第二是降低對外部存儲器的讀寫次數，更有效率的資料使用與傳輸，解決帶寬瓶頸。第三是降低片上存儲器必要的容量，縮小體積並降低成本。
本文提出基於深度可分離卷積的一套硬體演算法設計，可降低對外部存儲器的讀寫次數與降低片上存儲器的必要容量，使每個參數與特徵只需對外部存儲器讀寫一次，且片上存儲空間僅需要291KB，相對於標準卷積的相關設計，需要的片上存儲空間為其約八分之一。並使用硬體描述語言（HDL）verilog進行設計，使用模擬軟體進行模擬以驗證設計達到目標。

In recent years, with the popularity of mobile devices and the improvement of computing power, deep learning neural networks have made significant progress in the field of mobile edge applications, but they face some challenges. The computing power, storage capacity, and transmission bandwidth of the mobile terminal are low, and there are restrictions on complex models and large-scale data processing. Therefore, suitable deep learning models, paired with dedicated hardware designs, provide solutions to existing challenges.
This paper proposes a lightweight hardware architecture design for deep learning neural network depthwise separable convolution modules. Lightweight is achieved from three aspects. The first is to use depthwise separable convolutions instead of standard convolutions to reduce the amount of computation. The second is to reduce the number of reads and writes to external memory, more efficient data use and transmission, and solve bandwidth bottlenecks. The third is to reduce the necessary capacity of the on-chip memory, reduce the volume and reduce the cost.
This paper proposes a set of hardware algorithm design based on depth-separable convolution, which can reduce the number of reads and writes to the external memory and reduce the necessary capacity of the on-chip memory, so that each parameter and feature only needs to be read and written to the external memory once, and The on-chip storage space only needs 291KB, which is about one-eighth of the required on-chip storage space compared to the related design of standard convolution. And use hardware description language (HDL) verilog to design, use simulation software to simulate to verify that the design achieves the goal.

第一章、緒論 1
第一節、研究背景 1
第二節、研究目的 3
第三節、研究流程 5
第四節、論文架構 6
第二章、文獻探討 7
第一節、深度學習與計算機視覺 7
第二節、輕量化網路架構與深度可分離卷積 10
第三節、深度學習硬體架構 12
第三章、演算法分析 15
第一節、使用深度可分離卷積替代標準卷積 15
第二節、使用3×3卷積核 20
第三節、使用步距二的卷積替代池化實現下採樣 22
第四節、正規化 23
第五節、使用ReLU激活函數 24
第六節、使用int8量化資料 25
第四章、硬體架構 27
第一節、外部存儲器 29
第二節、片上存儲器 30
第三節、卷積單元 32
第四節、控制單元 40
第五章、實驗結果與比較 51
第一節實驗結果 51
第二節相關比較 58
第六章、結論與未來工作 61
第一節、結論 61
第二節、未來工作 62
參考文獻 63

[1]LeCun, Y., Bengio, Y. and Hinton, G. (2015) Deep learning. Nature, 521(7553), 436-444.
[2]Krizhevsky, A., Sutskever, I. and Hinton, G.E. (2012) ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84-90.
[3]Simonyan, K. and Zisserman, A. (2014) Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556.
[4]He, K., Zhang, X., Ren, S. and Sun, J. (2015) Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385.
[5]Ioffe, S. and Szegedy, C. (2015) Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv preprint arXiv:1502.03167.
[6]Girshick, R., Donahue, J., Darrell, T. and Malik, J. (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv preprint arXiv:1311.2524.
[7]Girshick, R. (2015) Fast R-CNN. arXiv preprint arXiv:1504.08083.
[8]Ren, S., He, K., Girshick, R. and Sun, J. (2016) Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv preprint arXiv:1506.01497.
[9]He, K., Gkioxari, G., Dollár, P. and Girshick, R. (2018) Mask R-CNN. arXiv preprint arXiv:1703.06870.
[10]Redmon, J., Divvala, S., Girshick, R. and Farhadi, A. (2016) You Only Look Once: Unified, Real-Time Object Detection. arXiv preprint arXiv:1506.02640.
[11]Redmon, J. and Farhadi, A. (2016) YOLO9000: Better, Faster, Stronger. arXiv preprint arXiv:1612.08242.
[12]Redmon, J. and Farhadi, A. (2018) YOLOv3: An Incremental Improvement. arXiv preprint arXiv:1804.02767.
[13]Bochkovskiy, A., Wang, C.Y. and Liao, H.Y.M. (2020) YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv preprint arXiv:2004.10934.
[14]Jocher, G. (2020) YOLOv5: End-to-End Object Detection with Efficient Backbone. [GitHub repository]. Retrieved from https://github.com/ultralytics/yolov5
[15]Ge, Z., Liu, S., Wang, F., Li, Z. and Sun, J. (2021) YOLOX: Exceeding YOLO Series in 2021. arXiv preprint arXiv:2107.08430.
[16]Li, C., Li, L., Jiang, H., Weng, K., Geng, Y., Li, L., Ke, Z., Li, Q., Cheng, M., Nie, W., Li, Y., Zhang, B., Liang, Y., Zhou, L., Xu, X., Chu, X., Wei, X. and Wei, X. (2022) YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv preprint arXiv:2209.02976.
[17]Wang, C.Y., Bochkovskiy, A. and Liao, H.Y.M. (2022) YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv preprint arXiv:2207.02696.
[18]Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M. and Adam, H. (2017) MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.
[19]Sandler, M., Howard, A., Zhu, M., Zhmoginov, A. and Chen, L.C. (2019) MobileNetV2: Inverted residuals and linear bottlenecks. arXiv preprint arXiv:1801.04381.
[20]Howard, A., Sandler, M., Chu, G., Chen, L. C., Chen, B., Tan, M., ... and Adam, H. (2019) Searching for mobilenetv3. arXiv preprint arXiv:1905.02244.
[21]Zhang, X., Zhou, X., Lin, M. and Sun, J. (2017) ShuffleNet: An extremely efficient convolutional neural network for mobile devices. arXiv preprint arXiv:1707.01083.
[22]Ma, N., Zhang, X., Zheng, H. T. and Sun, J. (2018) ShuffleNet V2: Practical guidelines for efficient CNN architecture design. arXiv preprint arXiv:1807.11164.
[23]Tan, M. and Le, Q.V. (2020) EfficientNet: Rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946.
[24]Tan, M. and Le, Q.V. (2021) EfficientNetV2: Smaller Models and Faster Training. arXiv preprint arXiv:2104.00298.
[25]Chen, Y.H., Krishna, T., Emer, J.S. and Sze, V. (2017) Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. IEEE Journal of Solid-State Circuits, 52(1), 127-138.
[26]Ma, Y., Cao, Y., Vrudhula, S. and Seo, J.S. (2018) Optimizing the Convolution Operation to Accelerate Deep Neural Networks on FPGA. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 26(7), 1354-1367.
[27]Chung-Bin, W., Ching-Shun, W. and Yu-Kuan, H. (2020) Reconfigurable Hardware Architecture Design and Implementation for AI Deep Learning Accelerator. IEEE Global Conference on Consumer Electronics (GCCE), 10.1109/GCCE50665.2020.9291854.

01.pdf

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文