作者(英文):Min-Hsuan Lai
論文名稱:基於卷積架構之線性Transformer 研究
論文名稱(英文):A Linear Transformer based on Convolutional Structures
指導教授(英文):I-Cheng Chang
口試委員(英文):Huang-Chia Shih
Wen-Chieh Fang
關鍵詞(英文):Deep LearningTransformerAttention mechanic
本論文提出了一種基於線性Transformer和CNN Token embedding的分類模型,名為Convolutional Vision Fastformer(CvF)。 Transformer架構主要可以分為兩個階段:首先通過Token embedding對輸入信息進行編碼,轉換為Token Q、K、V,然後將這三種Token輸入到Attention中進行交互計算。在本研究中,我們使用卷積結構替代了傳統Token embedding中使用的線性投影,並在卷積過程中引入了殘差結構,避免在重複卷積過程中丟失信息。此外,我們還提出了一種名為ChannelFusion block的新型Soft attention架構,該架構應用於Token embedding中,用於區分不同通道信息的重要性。在Token embedding的實驗中,我們將使用本文提出的方法與其他論文中提出的Soft attention結構在Token embedding效果上進行比較。在Attention方面,本文采用了線性Transformer的方式,減少了原始Transformer的計算複雜性。實驗結果顯示,與其他Transformer架構相比,CvF減少了浮點數運算量,並且具有更好的準確率;以相近的模型為例,CvT -13與本文提出的CvF-13在Cifar10與Cifar100的實驗上,準確率能夠提升1.4%-9%,而浮點數運算輛能夠減少約8%;儘管我們的方法會提高模型的parameter數量,但不論是CvF-13還是CvF-21相較相同層數的CvT模型parameter提升的數量不到1M,因此兩者仍屬於相同量級。
In the field of technology agriculture, plant pest recognition is a crucial issue. By combining pest recognition with automatic monitoring devices, it is possible to observe the growth status of plants quickly and accurately, providing feedback to users and enabling them to take appropriate measures, thus saving significant labor costs. In current research on plant pest recognition, many papers apply state-of-the-art Convolutional Neural Networks (CNNs) to identify plant diseases. It has been demonstrated in other image classification tasks that the Transformer structure, originally used for natural language processing, can also be applied to image classification and outperform CNNs in terms of performance. However, due to the computational complexity of the Transformer architecture itself, its practical application is relatively challenging, and CNNs remain the mainstream model currently used in practice.

This thesis proposes a hybrid model architecture based on Transformer and CNN, combining the high accuracy of the Transformer model with the integration of convolutional structures to reduce the overall computational complexity, allowing the model to be applied to practical tasks. The proposed model is named Convolutional Vision Fastformer (CVF), which is based on the linear Transformer and CNN Token embedding. The Transformer architecture can be divided into two stages: first, the input information is encoded using Token embedding, transforming it into Token Q, K, and V, which are then input into Attention for interactive computation. In this study, we replace the linear projection used in traditional Token embedding with convolutional operations and introduce residual structures in the convolutional process to avoid information loss during repeated convolutions. Additionally, we propose a new soft attention architecture called ChannelFusion block, which is applied to token embedding to distinguish the importance of different channel information. In the experiments on Token embedding, we compare the effectiveness of our proposed method with other papers' soft attention structures.
In terms of attention, this thesis adopts the linear Transformer approach to reduce the computational complexity of the original Transformer. Experimental results show that compared to other Transformer architectures, CvF reduces the number of floating-point operations and achieves better accuracy. Taking similar models as an example, CvT-13 and the proposed CvF-13 exhibit accuracy improvements of 1.4% to 9% on Cifar10 and Cifar100 experiments, while reducing floating-point operations by approximately 8%. Although our method increases the number of parameters in the model, both CvF-13 and CvF-21 have parameter increases of less than 1 million compared to CvT models with the same number of layers. Therefore, both models still fall within the same order of magnitude.
審定書 i
摘要 viii
Abstract ix
誌謝 xii
Content xiii
List of Figure xvi
List of Table xviii
List of Equation xx
Chapter 1 Introduction 1
Chapter 2 Related work 5
2.1 Image Classification 5
2.2 Lightweight Convolution 7
2.3 Attention Mechanism In CNN 8
2.4 Self-attention Model 10
2.5 Augmentation 12
Chapter 3 Approach 15
3.1 ChannelFusion Block 18
3.2 Residual Separable Convolution 20
3.3 Convolution Fastformer Block 22
3.4 Network Design 27
3.5 Computational Techniques for Model Training 29
Chapter 4 Experimental Results 31
4.1 Introduction to Datasets 31
4.2 Comparison of Augmentation Methods 38
4.3 Comparison of Soft Attention Methods 39
4.3.1 Comparison of soft attention 39
4.3.2 Soft attention with residual structure 44
4.4 Comparison of Activation Functions 46
4.5 Comparison of Activation Function in Fastformer 47
4.6 Comparison Performance with CvT 49
4.7 Comparison of SOTA Model 51
4.8 Comparison of Parameters and FLOPs 56
Chapter 5 Conclusion 59
References 62

