基於注意力機制之多任務神經網路應用於輔助駕駛系統__國立東華大學博碩士論文全文影像系統

帳號：guest(18.188.211.106) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者:	葉泓佑
作者(英文):	Hung-Yu Yeh
論文名稱:	基於注意力機制之多任務神經網路應用於輔助駕駛系統
論文名稱(英文):	An Attention-based Multitask Network for Advanced Driver Assistance System
指導教授:	張意政
指導教授(英文):	I-Cheng Chang
口試委員:	黃于飛王元凱
口試委員(英文):	Yu-Fei Huang Yuan-Kai Wang
學位類別:	碩士
校院名稱:	國立東華大學
系所名稱:	資訊工程學系
學號:	610421251
出版年(民國):	108
畢業學年度:	107
語文別:	英文
論文頁數:	31
關鍵詞:	多任務學習、注意力機制、卷積網路、實例分割、語意分割、輔助駕駛系統、感受野
關鍵詞(英文):	multi-task learning、attention、convolutional neural network、instance segmentation、semantic segmentation、ADAS、receptive field
相關次數:	推薦:0 點閱:22 評分: 下載:20 收藏:0

大多數電腦視覺任務較著重準確度的改善，但在一些即時的應用上，如自動駕駛輔助系統，運行時間以及記憶體使用也是重要的考量。多任務模型是一個可以在同一單位時間內運行多個任務的方法, 相對於一次運行單任務模型可省下不少運行時間以及記憶體使用量。而現有大多數多任務模型都是直接拿各自單任務領域中的現有元件組合成一個多任務模型，並不考慮參數配置以及模型目的，因此在此論文中我們提出一個新的方式來從多個單任務模型中有效結合出一個多任務模型。首先我們先決定任務的主次，再來設計模型是要如何學習以及如何分享權重。我們基於Mask R-CNN提出一個基於注意力機制的多任務模型，同時解決了輔助駕駛系統中三個重要的任務：語意分割, 實例分割, 單目深度估計。針對如何學習的問題，我們針對兩個次要的輔助任務各自提出一個能有效快速收斂的損失函數，這可以使得兩個任務在前期提供主任務訊息以增加準確度，並在後期因本身較為快速的收斂速度而不對主任務的收斂與準確度造成負面的影響。這樣的方式可以避免耗費巨大成本在手動調節各個損失函數之間的平衡。而針對如何分享的問題，我們提出一個全方位注意力模型(EAM)的模組，此模組在分享層特徵上通過全域紋理模組, 空間資訊模組，以及有效感受野放大模組來增加特徵層的語義。實驗結果顯示，雖然EAM在參數使用量上比一個3×3卷積層還要少，但在結合了FPN後仍使準確度上升，而聯合訓練也同樣使得準確度上升。

Most computer vision tasks traditionally focus on increasing accuracy; however, runtime and memory usage are also important issues which should be considered in real-world applications such as autonomous driver assistance system. A multi-task network is a great solution because of a model inference once for N results instead of N times for single-task models. After reviewing multiple tasks with single integrated network architecture, most of the existing approaches just directly combine existing components to build an integrated network architecture without considering the allocation of parameters among tasks. In this paper, we present a novel concept to determine how a task in the multi-task network exploited commonalities and differences from other tasks. We determine the architecture by differentiating priority among tasks and proposed a network based on Mask R-CNN that solves three advanced driver assistance system related tasks at once: semantic segmentation, instance segmentation, and monocular depth estimation. About model learning, we propose two loss functions that have a faster convergence speed for two auxiliary tasks, which can provide geometry features in early-stage and avoid having negative impact on the accuracy of main task in the late stage. The different converge speed between the main loss and two auxiliary losses makes us prevent from doing an expensive process to tune relative weight between each task’s loss by hand. To determine how to share information, we proposed a light-weight attention-based module call Entire Attention Module (EAM). EAM increased shared representative by enhanced global context, spatial information, and enlarge effective receptive field. Although using much fewer parameters than a 3×3 convolution layer, the experimental results show the accuracy increased no matter adapt EAM to FPN or joint training.

摘要 I
Abstract II
List of Figures IV
List of Tables V
Chapter 1. Introduction 1
1.1 Motivation 1
1.2 Related works 2
1.2.1 Features Enhancement 2
1.2.2 Multi-task learning 3
1.2.2.1. Instance Segmentation 4
1.2.2.2. Semantic Segmentation 4
1.2.2.3 Monocular depth prediction 4
1.3 System overview and Contributions 5
Chapter 2. Pyramid Entire Attention Network 7
Chapter 3. Entire Attention Module (EAM) 11
3.1 Channel-wise attention 11
3.2 Multiple receptive field attention 12
3.3 Self-spatial attention 13
3.4 Self-spatial attention with the multi-head mechanism 15
3.5 EAM and FPN 16
Chapter 4. Experimental Results 17
4.1 Implementation detail 17
4.2 EAM comparison 17
4.3 Instance Segmentation 18
4.4 Semantic Segmentation 19
4.4.1 Parameters and Accuracy 19
4.4.2 Ablation study 20
4.4.3 Visualization 21
4.5 Variant EAMs 24
Chapter 5. Conclusion 26
References 27

[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet classification with deep convolutional neural networks. “, In Neural Information Processing Systems (NIPS), 2012.
[2] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. “Rich feature hierarchies for accurate object detection and semantic segmentation.” In proceedings of IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2014.
[3] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. “SSD: Single shot multibox detector.” In proceedings of the European Conference on Computer Vision (ECCV), 2016.
[4] K. He, G. Gkioxari, P. Dollar, and R. B. Girshick. “Mask R-CNN.” In Proceedings of the International Conference on Computer Vision (ICCV), 2017.
[5] Jonathan Long, Evan Shelhamer, and Trevor Darrell. “Fully convolutional networks for semantic segmentation.” In proceedings of IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2015.
[6] Kuznietsov, Yevhen, Jörg Stückler, and Bastian Leibe. "Semi-supervised deep learning for monocular depth map prediction." In proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[7] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab. Deeper depth prediction with fully convolutional residual networks. In IEEE International Conference on 3D Vision (3DV), 2016.
[8] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. “Mobilenets: Efficient convolutional neural networks for mobile vision applications.” arXiv preprint arXiv:1704.04861, 2017
[9] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. “Segnet: A deep convolutional encoder-decoder architecture for image segmentation.” In IEEE transactions on pattern analysis and machine intelligence, 2017.
[10] Marvin Teichmann, Michael Weber, Marius Zoellner, Roberto Cipolla, and Raquel Urtasun. “Multinet: Real-time joint semantic reasoning for autonomous driving.” arXiv preprint arXiv:1612.07695, 2016.
[11] Zhaowei Cai and Quanfu Fan and Rogerio Feris and Nuno Vasconcelos,” A Unified Multi-scale Deep Convolutional Neural Network for Fast Object Detection”, In proceeding of the European Conference on Computer Vision (ECCV), 2016
[12] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. arXiv preprint, arXiv:1612.03144, 2016.
[13] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, “Going Deeper with Convolutions.”, In proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015
[14] C.Szegedy, V.Vanhoucke, S.Ioffe,J.Shlens, and Z.Wojna, “Rethinking the inception architecture for computer vision.” arXiv preprint, arXiv:1512.00567, 2015
[15] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, Alan L. Yuille, “Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs.” arXiv preprint, arXiv: 1412.7062, 2014
[16] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, Alan L. Yuille, “DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs.”, arXiv preprint, arXiv:1606.00915 2016
[17] Liang-Chieh Chen, George Papandreou, Florian Schroff, Hartwig Adam ,“Rethinking Atrous Convolution for Semantic Image Segmentation”, arXiv preprint, arXiv: 1706.05587 2017
[18] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, Hartwig Adam, “Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation”,arXiv preprint, arXiv: 1802.02611, 2018
[19] Maoke Yang Kun Yu Chi Zhang Zhiwei Li Kuiyuan Yang, “DenseASPP for Semantic Segmentation in Street Scenes”, In proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018
[20] Hengshuang Zhao and Jianping Shi and Xiaojuan Qi and Xiaogang Wang and Jiaya Jia, “Pyramid Scene Parsing Network”, In proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017
[21] Songtao Liu, Di Huang, and Yunhong Wang, “Receptive Field Block Net for Accurate and Fast Object Detection”, In proceeding of the European Conference on Computer Vision (ECCV), 2018
[22] Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L. “Bottom-up and top-down attention for image captioning and vqa.” , arXiv preprint, arXiv:1707.07998, 2017
[23] Bahdanau, D., Cho, K., Bengio.Y, “Neural machine translation by jointly learning to align and translate.”, arXiv preprint, arXiv:1409.0473, 2014
[24]Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I., “Attention is all you need.” In Neural Information Processing Systems (NIPS), 2017
[25] Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y. “Graph attention ´networks.” , arXiv preprint,arXiv:1710.10903, 2017
[26] Wang F., Jiang, M. Qian, C. Yang, S. Li, C. Zhang, H., Wang X., Tang X “Residual attention network for image classification.” In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017
[27] Hu J., Shen, L. Sun, G.: “Squeeze-and-excitation networks.” arXiv:1709.01507 (2017)
[28] Wang, X., Girshick, R., Gupta, A., He, K. “Non-local neural networks.”, arXiv preprint, arXiv:1711.07971, 2017
[29] Dai, Jifeng and He, Kaiming and Sun Jian, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016
[30] Yi Li, Haozhi Qi, Jifeng Dai, Xiangyang Ji and Yichen Wei,"Fully Convolutional Instance-aware Semantic Segmentation", In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017
[31] S. Liu, X. Qi, J. Shi, H. Zhang, and J. Jia. “Multi-scale patch aggregation (MPA) for simultaneous detection and segmentation.” In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[32] Shu Liu, Lu Qi, Qin, Jianping Shi, Jiaya Jia, “Path Aggregation Network for Instance Segmentation”, In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018
[33] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. “U-net: Convolutional networks for biomedical image segmentation.” In International Conference on Medical image computing and computer-assisted intervention, 2015.
[34] Jonas Uhrig, Marius Cordts, Uwe Franke, and T. Brox. “Pixel-level encoding and depth layering for instance-level semantic labeling.” In proceedings of the German Conference on Pattern Recognition (GCPR), 2016.
[35] Iasonas Kokkinos. “Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory.”, arXiv preprint, arXiv:1609.02132, 2016.
[36] Marvin Teichmann, Michael Weber, Marius Zoellner, Roberto Cipolla, and Raquel Urtasun, “Multinet: Real-time joint semantic reasoning for autonomous driving.”, arXiv preprint,arXiv:1612.07695, 2016.
[37] Davy Neven, Bert De Brabandere, Stamatios Georgoulis, Marc Proesmans and Luc Van Gool,” Fast Scene Understanding for Autonomous Driving”, In proceedings of IEEE Symposium on Intelligent Vehicles, 2017
[38] Alex Kendall, Yarin Gal, Roberto Cipolla, “Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics”, In proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018
[39] Z. Chen, V. Badrinarayanan, C. Lee, and A. Rabinovich. “GradNorm: Gradient normalization for adaptive loss balancing in deep multitask networks”, arXiv preprint, arXiv:1711.02257 (2017)
[40] H. Ha, S. Im, J. Park, H.-G. Jeon, and I. S. Kweon, “High-Quality Depth from Uncalibrated Small Motion Clip,” In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[41] N. Kong and M. J. Black, “Intrinsic depth: Improving depth transfer with intrinsic images,” In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015.
[42] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems, 2014.
[43] D. Xu, W. Wang, H. Tang, H. Liu, N. Sebe, and E. Ricci, “Structured attention guided convolutional neural fields for monocular depth estimation,” In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018
[44] S. Xie and Z. Tu, “Holistically-Nested Edge Detection,” In proceedings of International Journal of Computer Vision, 2017.
[45] B. Li, C. Shen, Y. Dai, A. van den Hengel, and M. He.” Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs”. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
[46] F. Liu, C. Shen, G. Lin, and I. Reid. Learning depth from single monocular images using deep convolutional neural fields. In IEEE transactions on pattern analysis and machine intelligence, 2016.
[47] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal regression network for monocular depth estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018
[48] Wenjie Luo, Yujia Li, Raquel Urtasun, Richard Zemel, “Understanding the Effective Receptive Field in Deep Convolutional Neural Networks”, In Neural Information Processing Systems (NIPS), 2016
[49] F. Yu and V. Koltun. “Multi-scale context aggregation by dilated convolutions.” In proceedings of the International Conference on Learning Representations (ICLR), 2016
[50] Yanghao Li, Yuntao Chen, Naiyan Wang, Zhaoxiang Zhang, “Scale-Aware Trident Networks for Object Detection”, arXiv preprint, arxiv:1901.01892, 2019
[51] H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. In IEEE International Conference on Computer Vision, In Proceedings of the International Conference on Computer Vision (ICCV), 2015
[52] I. Laina, C. Rupprecht, “Deeper depth prediction with fully convolutional residual networks”, In 3D Vision (3DV), 2016 Fourth International Conference on. IEEE, 2016
[53] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” In Advances in neural information processing systems, 2014
[54] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Dollár, “Focal Loss for Dense Object Detection”, In Proceedings of the International Conference on Computer Vision (ICCV), 2017.
[55] L. Zwald and S. Lambert-Lacroix. “The berhu penalty and the grouped effect.”, arXiv preprint, arXiv:1207.6868, 2012.
[56] Wang, Z. Simoncelli, E.P Bovik, “Multiscale structural similarity for image quality assessment.”, Conference Record of the Thirty-Seventh Asilomar Conference on Signals, Systems and Computers, 2004.
[57] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

01.pdf

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文