帳號:guest(18.118.0.203)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目勘誤回報
作者:陳鼎勳
作者(英文):Ding-Shun Chen
論文名稱:基於深度學習的台灣手語辨識研究
論文名稱(英文):Research of Taiwanese Sign Language Recognition Based on Deep Learning
指導教授:羅壽之
指導教授(英文):Shou-Chih Lo
口試委員:李官陵
彭勝龍
口試委員(英文):Guan-Ling Lee
Sheng-Lung Peng
學位類別:碩士
校院名稱:國立東華大學
系所名稱:資訊工程學系
學號:610821244
出版年(民國):111
畢業學年度:110
語文別:中文
論文頁數:62
關鍵詞:動作辨識深度學習台灣手語手語辨識物件偵測
關鍵詞(英文):Action RecognitionDeep LearningTaiwan Sign LanguageSign Language RecognitionObject Detection
相關次數:
  • 推薦推薦:0
  • 點閱點閱:102
  • 評分評分:系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔
  • 下載下載:60
  • 收藏收藏:0
當今全世界有超過4000萬的聽障人士,在台灣就約有12多萬,而手語不論在聽人(聽覺正常的人)與聾人,或是聾人與聾人間皆扮演著重要的角色,本研究希望建立一個台灣的手語辨識系統,不論是提供手語的學習或是一般的翻譯辨識均能有所幫助,為此我們收集並建立一個擁有15種不同版本、52個單詞的台灣手語資料集。
手語辨識及翻譯在動作辨識的領域上一直是熱門的話題,其辨識方法跟動作辨識幾乎都大同小異,主要是手語辨識及翻譯更注重手部表達的重要性,而使用的資料像是RGB、光流或關節點等,搭配不同的時序網路HMM、LSTM、TCN或注意力機制,使用各個領域的模型來建構整個深度網路已成常態。
由於當前手語辨識及翻譯的主流為使用整張圖像或使用裁切圖像,而使用裁切的圖像其準確度通常較高,但相對的其執行代價會較高,除了其裁切的方法,還有因為裁切圖像通常會搭配整張的圖像作為其輔助,而對於這兩個影像流而言其特徵抽取或融合為主要的模型計算開銷,本文使用高效能的YOLO作為實驗的裁切方法,並搭配混合的特徵提取其採樣方法,在維持高準確度下能保有不錯的執行效能,整體辨識率皆有在80%以上。
Nowadays, there are more than 40 million hearing-impaired people in the world, and among which there are more than 120,000 people in Taiwan. Sign language plays an important role on the communication between hearing people and deaf, or between deaf and deaf. In this research, we hope to establish a Taiwanese sign language recognition system that can help to sign language learning or general translation recognition. For this reason, we collect and build a dataset including 52 common words with 15 different versions in Taiwanese sign language.
Sign language recognition and translation have attracted lots of attentions in the field of action recognition. Their recognition methods are almost the same as those used for action recognition. The main difference is that the sign language recognition and translation focus more on hand expression. Typical recognition methods operate on video data where RGB pixel, optical flow or joint information are retrieved, with different time-series networks such as HMM, LSTM, TCN or attention mechanisms. It has become popular to use learning models from various fields to construct the entire deep network.
The major approaches of sign language recognition and translation are based on either whole images or cropped images. The accuracy of recognition is usually high when using cropped images, but the execution cost will be relatively high. The cropped images usually go with whole images as auxiliary data, and the feature extraction or fusion takes most of the computation cost on these two image streams. This paper uses the high-performance YOLO as the cropping method, combined with mixed feature extraction and sampling to achieve a high accuracy with reasonable execution performance, and the overall recognition rate is above 80%.
摘要 ii
Abstract iv
目錄 vi
表目錄 viii
圖目錄 x
第一章 前言 1
1-1 研究背景與動機 1
1-2 研究目的 1
1-3 論文綱要 2
第二章 相關研究 3
2-1 手語介紹 3
2-1-1 世界手語 3
2-1-2 台灣手語 4
2-2 手語識別 6
2-2-1 機器學習 7
2-2-2 深度學習 8
2-2-3 動作辨識 10
2-2-4 手語辨識及翻譯 11
2-2-5 物件偵測模型 13
2-2-6 時序模型 20
第三章 台灣手語辨識 23
3-1 台灣手語資料集 23
3-1-1 資料集創建 23
3-1-2 資料集分割 26
3-2 手語辨識流程 27
3-2-1 資料預處理 27
3-2-2 資料強化 28
3-2-3 網路模型建構 29
3-2-4 實驗模型介紹 37
第四章 實驗結果及討論 41
4-1 實驗環境 41
4-2 評估方法 42
4-2-1 Yolov5評估方法 42
4-2-2 動作辨識網路評估方法 43
4-3 實驗方法討論 43
4-3-1 Yolov5結果展示 43
4-3-2 動作辨識網路模型比較 46
4-3-3 Hybrid-GRU模型參數測試 48
4-3-4 圖像輸入測試 49
4-3-5 資料強化測試 50
4-3-6 輸入長度測試 51
4-3-7 實驗討論 53
第五章 結論與未來工作 55
5-1 結論 55
5-2 未來工作 55
參考文獻 57
百靈佳殷格翰 MMH 計畫(無日期)。【消弭不平等】手語界的維基百科-科技看見了聽障者的聲音。取自https://www.boehringer-ingelheim.tw/making-more-health-3。
莉娜手語工作坊(2006年1月2日)。超有趣手語廣角鏡。取自https://susan6262.pixnet.net/blog/post/287696432。
李信賢(2019年7月8日)。國際手語(IS)是否為一種語言?。取自http://taslifamily.org/?p=4826。
台灣手語(無日期)。维基百科。取自2022年8月1日,https://zh.wikipedia.org/w/index.php?title=%E5%8F%B0%E7%81%A3%E6%89%8B%E8%AA%9E&oldid=68823853。
林雨佑(2020年7月1日)。真的假的?台灣手語跟日本手語嘛會通?。取自https://www.twreporter.org/a/mini-reporter-sign-language-taiwan-and-japan。
王振德(2000年)。國語口手語。載於劉真(主編),教育大辭書(2000年)。台北市:文景。取自https://terms.naer.edu.tw/detail/1309266/。
許逸如(2020年8月21日)。【語言S4E10】台灣手語也有南北差異?!原來台灣手語這麼有趣。取自https://www.mirrormedia.mg/story/20200820cul007/。
蔡素娟、戴浩一、陳怡君(2015年)。【台灣手語線上辭典】第三版中文版.國立中正大學語言學研究所。取自http://tsl.ccu.edu.tw/web/browser.htm
SignTube (無日期)。基礎台灣手語[Basic Taiwanese Sign Language]。YouTube。取自2022年8月1日,https://www.youtube.com/playlist?list=PLzI2EvXfsJoM---uqlUP56fENljJ3JWrO
albumentations-team. (n.d.). albumentations. Retrieved August 1, 2022, from https://github.com/albumentations-team/albumentations.
He, S. (2019, October). Research of a sign language translation system based on deep learning. 2019 International Conference on Artificial Intelligence and Advanced Manufacturing (AIAM), pp. 392-396. doi: 10.1109/AIAM48774.2019.00083
HHTseng. (n.d.). Video Classification. Retrieved August 1, 2022, from https://github.com/HHTseng/video-classification.
Loye, G. (2020, February 10). Gated recurrent unit (GRU) with pytorch. Retrieved from https://blog.floydhub.com/gru-with-pytorch/
SkalskiP. (n.d.). make-sense.ai. Retrieved August 1, 2022, from https://github.com/SkalskiP/make-sense
Ultralytics. (n.d.). yolov5. Retrieved August 1, 2022, from https://github.com/ultralytics/yolov5
Bochkovskiy, A., Wang, C. Y., & Liao, H. Y. M. (2020). Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934.
Chaikaew, A., Somkuan, K., & Yuyen, T. (2021, March). Thai sign language recognition: an application of deep neural network. 2021 Joint International Conference on Digital Arts, Media and Technology with ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunication Engineering, pp. 128-131. doi: 10.1109/ECTIDAMTNCON51128.2021.9425711
Dima, T. F., & Ahmed, M. E. (2021, July). Using YOLOv5 Algorithm to Detect and Recognize American Sign Language. 2021 International Conference on Information Technology (ICIT), pp. 603-607. doi: 10.1109/ICIT52682.2021.9491672
Farha, Y. A., & Gall, J. (2019). Ms-tcn: Multi-stage temporal convolutional network for action segmentation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575-3584.
Gao, W., Fang, G., Zhao, D., & Chen, Y. (2004). A Chinese sign language recognition system based on SOFM/SRN/HMM. Pattern Recognition, 37(12), 2389-2402. doi: 10.1016/S0031-3203(04)00165-7
Guo, D., Wang, S., Tian, Q., & Wang, M. (2019, August). Dense Temporal Convolution Network for Sign Language Translation. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19), pp. 744-750. doi: 10.24963/ijcai.2019/105
Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700-4708. doi: 10.1109/CVPR.2017.243
Hara, K., Kataoka, H., & Satoh, Y. (2018). Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6546-6555. doi: 10.1109/CVPR.2018.00685
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778. doi: 10.1109/CVPR.2016.90
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). ImageNet classification with deep convolutional Neural Networks. Communications of the ACM, 60(6), 84-90. doi:10.1145/3065386
Koller, O., Zargaran, O., Ney, H., & Bowden, R. (2016). Deep sign: Hybrid CNN-HMM for continuous sign language recognition. Proceedings of the British Machine Vision Conference (BMVC), pp. 136.1-136.12. doi: 10.5244/C.30.136
Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117-2125.
Lea, C., Flynn, M. D., Vidal, R., Reiter, A., & Hager, G. D. (2017). Temporal convolutional networks for action segmentation and detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156-165. doi: 10.1109/CVPR.2017.113
Lin, J., Gan, C., & Han, S. (2019). Tsm: Temporal shift module for efficient video understanding. Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7083-7093. doi: 10.1109/ICCV.2019.00718
Liu, S., Qi, L., Qin, H., Shi, J., & Jia, J. (2018). Path aggregation network for instance segmentation. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8759-8768. doi: 10.1109/CVPR.2018.00913
Mittal, A., Zisserman, A., & Torr, P. H. (2011, August). Hand detection using multiple proposals. Proceedings of the British Machine Vision Conference, pp.75.1-75.11. doi: 10.5244/C.25.75
Oord, A. V. D., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., ... & Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.
Qiu, Z., Yao, T., & Mei, T. (2017). Learning spatio-temporal representation with pseudo-3d residual networks. Proceedings of the IEEE International Conference on Computer Vision, pp. 5533-5541.
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779-788. doi: 10.1109/CVPR.2016.91
Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), pp. 1137-1149. doi: 10.1109/TPAMI.2016.2577031
Redmon, J., & Farhadi, A. (2017). YOLO9000: better, faster, stronger. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7263-7271. doi: 10.1109/CVPR.2017.690
Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767.
Slimane, F. B., & Bouguessa, M. (2021, January). Context matters: Self-attention for sign language recognition. 2020 25th International Conference on Pattern Recognition (ICPR), pp. 7884-7891. doi: 10.1109/ICPR48806.2021.9412916
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., … & Rabinovich, A. (2015). Going deeper with convolutions. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1-9. doi: 10.1109/CVPR.2015.7298594
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. Proceedings of the IEEE international conference on computer vision, pp. 4489-4497. doi: 10.1109/ICCV.2015.510
Tan, M., & Le, Q. (2019, May). Efficientnet: Rethinking model scaling for convolutional neural networks. Proceedings of the 36th International Conference on Machine Learning, pp. 6105-6114.
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018). A closer look at spatiotemporal convolutions for action recognition. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6450-6459. doi: 10.1109/CVPR.2018.00675
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems 30.
Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2013). Dense trajectories and motion boundary descriptors for action recognition. International journal of computer vision, 103(1), 60-79. doi: 10.1007/s11263-012-0594-8
Wang, C. Y., Liao, H. Y. M., Wu, Y. H., Chen, P. Y., Hsieh, J. W., & Yeh, I. H. (2020). CSPNet: A new backbone that can enhance learning capability of CNN. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp. 390-391. doi: 10.1109/CVPRW50498.2020.00203
Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. Proceedings of the IEEE international conference on computer vision, pp. 3551-3558. doi: 10.1109/ICCV.2013.441
Xie, S., Sun, C., Huang, J., Tu, Z., & Murphy, K. (2018). Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Ferrari, V., Hebert, M., Sminchisescu, C., & Weiss, Y. (Eds.), Lecture Notes in Computer Science: Vol. 11219. Computer Vision – ECCV 2018 (pp. 305-321). Cham, Switzerland: Springer Nature. doi: 10.1007/978-3-030-01267-0_19
Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., & Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4694-4702. doi: 10.1109/CVPR.2015.7299101
Zhao, K., Zhang, K., Zhai, Y., Wang, D., & Su, J. (2020, July). Real-time sign language recognition based on video stream. 2020 39th Chinese Control Conference (CCC), pp. 7469-7474. doi: 10.23919/CCC50068.2020.9188508
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *