基於卷積架構之線性Transformer 研究__國立東華大學博碩士論文全文影像系統

帳號：guest(3.12.164.107) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者:	賴旻瑄
作者(英文):	Min-Hsuan Lai
論文名稱:	基於卷積架構之線性Transformer 研究
論文名稱(英文):	A Linear Transformer based on Convolutional Structures
指導教授:	張意政
指導教授(英文):	I-Cheng Chang
口試委員:	施皇嘉方文杰
口試委員(英文):	Huang-Chia Shih Wen-Chieh Fang
學位類別:	碩士
校院名稱:	國立東華大學
系所名稱:	資訊工程學系
學號:	610821221
出版年(民國):	112
畢業學年度:	111
語文別:	英文
論文頁數:	86
關鍵詞:	深度學習、Transformer、注意力機制
關鍵詞(英文):	Deep Learning、Transformer、Attention mechanic
相關次數:	推薦:0 點閱:19 評分: 下載:14 收藏:0

在科技農業領域中，植物病蟲害辨識是一個非常重要的議題。將病蟲害辨識與自動監控設備結合，可以快速準確地觀察植物的生長狀況，並向用戶提供反饋信息，讓其採取相應的措施，從而節省大量人力成本。在當前與植物病蟲害辨識相關的研究中，許多論文將最先進的卷積神經網絡（CNN）應用於植物疾病的辨識。在其他圖像分類任務中，已經證明了最初用於自然語言處理的Transformer結構也可以應用於圖像分類，並且在性能上優於CNN。然而，由於Transformer架構本身的計算複雜性，實際應用相對困難，因此目前實際使用的模型仍以CNN為主流。本論文提出了一種基於Transformer和CNN的複合型模型架構，結合了Transformer模型的高準確性，並融合了卷積結構以降低整體的計算複雜性，使得該模型可以應用於實際任務中。

本論文提出了一種基於線性Transformer和CNN Token embedding的分類模型，名為Convolutional Vision Fastformer（CvF）。 Transformer架構主要可以分為兩個階段：首先通過Token embedding對輸入信息進行編碼，轉換為Token Q、K、V，然後將這三種Token輸入到Attention中進行交互計算。在本研究中，我們使用卷積結構替代了傳統Token embedding中使用的線性投影，並在卷積過程中引入了殘差結構，避免在重複卷積過程中丟失信息。此外，我們還提出了一種名為ChannelFusion block的新型Soft attention架構，該架構應用於Token embedding中，用於區分不同通道信息的重要性。在Token embedding的實驗中，我們將使用本文提出的方法與其他論文中提出的Soft attention結構在Token embedding效果上進行比較。在Attention方面，本文采用了線性Transformer的方式，減少了原始Transformer的計算複雜性。實驗結果顯示，與其他Transformer架構相比，CvF減少了浮點數運算量，並且具有更好的準確率；以相近的模型為例，CvT -13與本文提出的CvF-13在Cifar10與Cifar100的實驗上，準確率能夠提升1.4%-9%，而浮點數運算輛能夠減少約8%；儘管我們的方法會提高模型的parameter數量，但不論是CvF-13還是CvF-21相較相同層數的CvT模型parameter提升的數量不到1M，因此兩者仍屬於相同量級。

In the field of technology agriculture, plant pest recognition is a crucial issue. By combining pest recognition with automatic monitoring devices, it is possible to observe the growth status of plants quickly and accurately, providing feedback to users and enabling them to take appropriate measures, thus saving significant labor costs. In current research on plant pest recognition, many papers apply state-of-the-art Convolutional Neural Networks (CNNs) to identify plant diseases. It has been demonstrated in other image classification tasks that the Transformer structure, originally used for natural language processing, can also be applied to image classification and outperform CNNs in terms of performance. However, due to the computational complexity of the Transformer architecture itself, its practical application is relatively challenging, and CNNs remain the mainstream model currently used in practice.

This thesis proposes a hybrid model architecture based on Transformer and CNN, combining the high accuracy of the Transformer model with the integration of convolutional structures to reduce the overall computational complexity, allowing the model to be applied to practical tasks. The proposed model is named Convolutional Vision Fastformer (CVF), which is based on the linear Transformer and CNN Token embedding. The Transformer architecture can be divided into two stages: first, the input information is encoded using Token embedding, transforming it into Token Q, K, and V, which are then input into Attention for interactive computation. In this study, we replace the linear projection used in traditional Token embedding with convolutional operations and introduce residual structures in the convolutional process to avoid information loss during repeated convolutions. Additionally, we propose a new soft attention architecture called ChannelFusion block, which is applied to token embedding to distinguish the importance of different channel information. In the experiments on Token embedding, we compare the effectiveness of our proposed method with other papers' soft attention structures.
In terms of attention, this thesis adopts the linear Transformer approach to reduce the computational complexity of the original Transformer. Experimental results show that compared to other Transformer architectures, CvF reduces the number of floating-point operations and achieves better accuracy. Taking similar models as an example, CvT-13 and the proposed CvF-13 exhibit accuracy improvements of 1.4% to 9% on Cifar10 and Cifar100 experiments, while reducing floating-point operations by approximately 8%. Although our method increases the number of parameters in the model, both CvF-13 and CvF-21 have parameter increases of less than 1 million compared to CvT models with the same number of layers. Therefore, both models still fall within the same order of magnitude.

審定書 i
摘要 viii
Abstract ix
誌謝 xii
Content xiii
List of Figure xvi
List of Table xviii
List of Equation xx
Chapter 1 Introduction 1
Chapter 2 Related work 5
2.1 Image Classification 5
2.2 Lightweight Convolution 7
2.3 Attention Mechanism In CNN 8
2.4 Self-attention Model 10
2.5 Augmentation 12
Chapter 3 Approach 15
3.1 ChannelFusion Block 18
3.2 Residual Separable Convolution 20
3.3 Convolution Fastformer Block 22
3.4 Network Design 27
3.5 Computational Techniques for Model Training 29
Chapter 4 Experimental Results 31
4.1 Introduction to Datasets 31
4.2 Comparison of Augmentation Methods 38
4.3 Comparison of Soft Attention Methods 39
4.3.1 Comparison of soft attention 39
4.3.2 Soft attention with residual structure 44
4.4 Comparison of Activation Functions 46
4.5 Comparison of Activation Function in Fastformer 47
4.6 Comparison Performance with CvT 49
4.7 Comparison of SOTA Model 51
4.8 Comparison of Parameters and FLOPs 56
Chapter 5 Conclusion 59
References 62

[1]Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, "Gradient-based learning applied to document recognition," in Proceedings of the IEEE, vol. 86, no. 11, 1998, pp. 2278-2324, doi: 10.1109/5.726791
[2]A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet Classification with Deep Convolutional Neural Networks," in Advances in neural information processing systems, 2012, pp. 1097-1105.
[3]H. Guo, S. Wang, "Long-Tailed Multi-Label Visual Recognition by Collaborative Training on Uniform and Re-Balanced Samplings.", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 15089-15098
[4]K. Simonyan and A. Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Recognition," in The International Conference on Learning Representations, 2015.
[5]Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong Han, Guiguang Ding, and Jian Sun. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13733–13742, 2021.
[6]C. Szegedy et al., "Going Deeper with Convolutions," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1-9.
[7]K. He, X. Zhang, S. Ren and J. Sun, "Deep Residual Learning for Image Recognition," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770-778, doi: 10.1109/CVPR.2016.90.
[8]S. Zagoruyko, N. Komodakis, Wide residual networks, in: Proceedings of the British Machine Vision Conference (BMVC), 2016, pp. 87.1–87.12.
[9]Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and ´ Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1492– 1500, 2017
[10]Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Zhi Zhang, Haibin Lin, Yue Sun, Tong He, Jonas Mueller, R Manmatha, et al. Resnest: Split-attention networks. arXiv preprint arXiv:2004.08955, 2020.
[11]G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, "Densely Connected Convolutional Networks," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700-4708.
[12]M. Tan, Q. Le, "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks," in Proceedings of the 36th International Conference on Machine Learning, 2019, pp. 6105-6114.
[13]M. Sandler, A. Howard, M. Zhu, A. Zhmoginov and L. -C. Chen, "MobileNetV2: Inverted Residuals and Linear Bottlenecks," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 4510-4520
[14]M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, Q. V. Le, “MnasNet: Platform-Aware Neural Architecture Search for Mobile”, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
[15]M. Tan, Q. Le, "EfficientNetV2: Smaller Models and Faster Training," in Proceedings of the 38th International Conference on Machine Learning, 2021, pp. 10096-10106.
[16]Dosovitskiy Alexey, Beyer Lucas, Kolesnikov Alexander, Weissenborn Dirk, Zhai Xiaohua, Unterthiner Thomas, Dehghani Mostafa, Minderer Matthias, Heigold Georg, Gelly Sylvain, et al. An image is worth 16 ×× 16 words: Transformers for image recognition at scale. arXiv:2010.11929. Retrieved from https://arxiv.org/abs/2010.11929.2020.
[17]F. N. Iandola, S. Han, M.W. Moskewicz, K. Ashraf, W.J. Dally, and K. Keutzer, "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. ", arXiv preprint arXiv:1602.07360., 2016.
[18]A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, H. Adam, "Mobilenets: Efficient convolutional neural networks for mobile vision applications.", arXiv preprint arXiv:1704.04861, 2017.
[19]A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, Q. V. Le, H. Adam, “Searching for MobileNetV3”, ICCV2019, 2019.
[20]F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang. Residual attention network for image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2017
[21]J. Hu, L. Shen, S. Albanie, G. Sun, E. Wu, "Squeeze-and-Excitation Networks," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 7132-7141, doi: 10.1109/CVPR.2018.00745.
[22]Park, J., Woo, S., Lee, J.Y., Kweon, I.S.: Bam: Bottleneck attention module. In: Proc. of British Machine Vision Conference (BMVC). (2018)
[23]S. Woo, J. Park, J. Lee, I.S. Kweon, "CBAM: Convolutional block attention module.", Proceedings of the European conference on computer vision (ECCV), 2018.
[24]Shaw Peter, Uszkoreit Jakob, and Vaswani Ashish,” Shaw Peter, Uszkoreit Jakob, and Vaswani Ashish.”, arXiv preprint arXiv:1803.02155, 2018.
[25]Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. preprint arXiv:1810.04805, 2018.
[26]T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
[27]Zihang Dai, Hanxiao Liu, Quoc V. Le, and Mingxing Tan. Coatnet: Marrying convolution and attention for all data sizes. CoRR, abs/2106.04803, 2021.
[28]Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. Cvt: Introducing convolutions to vision transformers. arXiv preprint arXiv:2103.15808, 2021.
[29]Stéphane d’Ascoli, Hugo Touvron, Matthew Leavitt, Ari Morcos, Giulio Biroli, and Levent Sagun. ConViT: Improving vision transformers with soft convolutional inductive biases. ICML, 2021.
[30]W. Wang et al., "Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction Without Convolutions," 2021, pp. 568-578.
[31]Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, "Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows," 2021, pp. 10012-10022.
[32]Z. Zhang, H. Zhang, L. Zhao, T. Chen, and T. Pfister, "Aggregating Nested Transformers," ArXiv210512723 Cs, Jun. 2021.
[33]Wang Sinong, Li Belinda, Khabsa Madian, Fang Han, and Ma Hao. 2020. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768 ,2020.
[34]Wu, C., Wu, F., Qi, T., Huang, Y., Xie, X.: Fastformer: additive attention can be all you need. arXiv preprint arXiv:2108.09084 ,2021.
[35]O. Chapelle, J. Weston, L. Bottou, and V. Vapnik, “Vicinal risk minimization,” in Proc. Adv. Neural Inf. Process. Syst., MIT Press, 2001, pp. 416–422.
[36]X. Bouthillier, K. Konda, P. Vincent, and R. Memisevic, “Dropout as data augmentation,” arXiv:1506.08700, 2015
[37]H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization. In ICLR, 2018.
[38]Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., and Le, Q. V. Autoaugment: Learning augmentation policies from data. CVPR, 2019.
[39]E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le, “Randaugment: Practical data augmentation with no separate search,” CoRR, vol. abs/1909.13719, 2019.
[40]Samuel G. Müller and Frank Hutter. Trivialaugment: Tuning-free yet state-of-the-art data augmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 774–782, October 2021.
[41]D. Hughes and M. Salathé, "An Open Access Repository of Images on Plant Health to Enable the Development of Mobile Disease Diagnostics," arXiv preprint arXiv:1511.08060, 2015.
[42]J.-B. Cordonnier, A. Loukas, and M. Jaggi. On the relationship between self-attention and convolutional layers. In ICLR, 2020.
[43]Chuhan Wu, Fangzhao Wu, Tao Qi, and Yongfeng Huang, “Fastformer: Additive attention can be all you need,” arXiv preprint arXiv:2108.09084, 2021.
[44]M.A. Islam, S. Jia, N.D.B. Bruce, “How Much Position Information Do Convolutional Neural Networks Encode?”, ICLR, 2020.
[45]T.B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D.M. Ziegler, J. Wu, C.Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, “Language Models are Few-Shot Learners”, ArXiv, abs/2005.14165., 2020.
[46]D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N.M. Shazeer, Z. Chen less, "GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding," ArXiv200616668 Cs Stat, Jun. 2020.
[47]J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, Jun. 2019, pp. 4171-4186. doi: 10.18653/v1/N19-1423.

(此全文20250605後開放外部瀏覽)
01.pdf

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文