作者(英文):Hung-Yu Yeh
論文名稱(英文):An Attention-based Multitask Network for Advanced Driver Assistance System
指導教授(英文):I-Cheng Chang
口試委員(英文):Yu-Fei Huang
Yuan-Kai Wang
關鍵詞(英文):multi-task learningattentionconvolutional neural networkinstance segmentationsemantic segmentationADASreceptive field
大多數電腦視覺任務較著重準確度的改善,但在一些即時的應用上,如自動駕駛輔助系統,運行時間以及記憶體使用也是重要的考量。多任務模型是一個可以在同一單位時間內運行多個任務的方法, 相對於一次運行單任務模型可省下不少運行時間以及記憶體使用量。而現有大多數多任務模型都是直接拿各自單任務領域中的現有元件組合成一個多任務模型,並不考慮參數配置以及模型目的,因此在此論文中我們提出一個新的方式來從多個單任務模型中有效結合出一個多任務模型。首先我們先決定任務的主次,再來設計模型是要如何學習以及如何分享權重。我們基於Mask R-CNN提出一個基於注意力機制的多任務模型,同時解決了輔助駕駛系統中三個重要的任務:語意分割, 實例分割, 單目深度估計。針對如何學習的問題,我們針對兩個次要的輔助任務各自提出一個能有效快速收斂的損失函數,這可以使得兩個任務在前期提供主任務訊息以增加準確度,並在後期因本身較為快速的收斂速度而不對主任務的收斂與準確度造成負面的影響。這樣的方式可以避免耗費巨大成本在手動調節各個損失函數之間的平衡。而針對如何分享的問題,我們提出一個全方位注意力模型(EAM)的模組,此模組在分享層特徵上通過全域紋理模組, 空間資訊模組,以及有效感受野放大模組來增加特徵層的語義。實驗結果顯示,雖然EAM在參數使用量上比一個3×3卷積層還要少,但在結合了FPN後仍使準確度上升,而聯合訓練也同樣使得準確度上升。
Most computer vision tasks traditionally focus on increasing accuracy; however, runtime and memory usage are also important issues which should be considered in real-world applications such as autonomous driver assistance system. A multi-task network is a great solution because of a model inference once for N results instead of N times for single-task models. After reviewing multiple tasks with single integrated network architecture, most of the existing approaches just directly combine existing components to build an integrated network architecture without considering the allocation of parameters among tasks. In this paper, we present a novel concept to determine how a task in the multi-task network exploited commonalities and differences from other tasks. We determine the architecture by differentiating priority among tasks and proposed a network based on Mask R-CNN that solves three advanced driver assistance system related tasks at once: semantic segmentation, instance segmentation, and monocular depth estimation. About model learning, we propose two loss functions that have a faster convergence speed for two auxiliary tasks, which can provide geometry features in early-stage and avoid having negative impact on the accuracy of main task in the late stage. The different converge speed between the main loss and two auxiliary losses makes us prevent from doing an expensive process to tune relative weight between each task’s loss by hand. To determine how to share information, we proposed a light-weight attention-based module call Entire Attention Module (EAM). EAM increased shared representative by enhanced global context, spatial information, and enlarge effective receptive field. Although using much fewer parameters than a 3×3 convolution layer, the experimental results show the accuracy increased no matter adapt EAM to FPN or joint training.
摘要 I
Abstract II
List of Figures IV
List of Tables V
Chapter 1. Introduction 1
1.1 Motivation 1
1.2 Related works 2
1.2.1 Features Enhancement 2
1.2.2 Multi-task learning 3 Instance Segmentation 4 Semantic Segmentation 4 Monocular depth prediction 4
1.3 System overview and Contributions 5
Chapter 2. Pyramid Entire Attention Network 7
Chapter 3. Entire Attention Module (EAM) 11
3.1 Channel-wise attention 11
3.2 Multiple receptive field attention 12
3.3 Self-spatial attention 13
3.4 Self-spatial attention with the multi-head mechanism 15
3.5 EAM and FPN 16
Chapter 4. Experimental Results 17
4.1 Implementation detail 17
4.2 EAM comparison 17
4.3 Instance Segmentation 18
4.4 Semantic Segmentation 19
4.4.1 Parameters and Accuracy 19
4.4.2 Ablation study 20
4.4.3 Visualization 21
4.5 Variant EAMs 24
Chapter 5. Conclusion 26
References 27
