利用餘弦相似度偵測印尼文件之抄襲問題探討__國立東華大學博碩士論文全文影像系統

帳號：guest(18.223.172.181) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者:	Desmon Kristanto Siahaan
作者(英文):	Desmon Kristanto Siahaan
論文名稱:	利用餘弦相似度偵測印尼文件之抄襲問題探討
論文名稱(英文):	Plagiarism Detection of Indonesian Documents by using Cosine Similarity
指導教授:	李官陵
指導教授(英文):	Guan-Ling Lee
口試委員:	張耀中羅壽之
口試委員(英文):	Yao-Chung Chang Shou-Chih Lo
學位類別:	碩士
校院名稱:	國立東華大學
系所名稱:	資訊工程學系
學號:	610621306
出版年(民國):	109
畢業學年度:	108
語文別:	英文
論文頁數:	31
關鍵詞:	Porter Tala演算、相似度偵測、餘弦相似度
關鍵詞(英文):	Porter Tala Algorithm、Plagiarism、Cosine Similarity
相關次數:	推薦:0 點閱:34 評分: 下載:28 收藏:0

在學術環境中，研究論文的真實性非常重要。研究人員在發表論文時，必須確認是否與已經發表的論文存在著重覆性，而論文審查人員在審查論文時也必須確認論文是否有抄襲的可能。因此文件相似度的比對是一個很重要的議題，目前，英文文件的相似度比對已經被廣泛地探討，然而鮮少論文探討印尼語文件的相似度比較，在本篇論文中，我們探討了印尼語論文相似度比較的議題，並提出了一有效的演算方法，在方法中，我們利用Porter Tala演算方法將印尼單詞更改為詞根，Porter Tala是由Fadillah Z Tala所提出，針對印尼語單詞找出詞根的著名方法，在找出詞根後，我們利用餘弦相似度計算論文的相似度，實驗結果顯示我們提出的方法能有效地偵測出相似的論文。

In an academic environment, the authenticity of research papers is very important. When a researcher publishes a paper, he must confirm whether there is repetition with the published paper, and the reviewer must also confirm whether the paper may be copied. Therefore, the comparison of document similarity is an important issue. At present, the similarity comparison of English documents has been extensively discussed, but few papers discuss the similarity comparison of Indonesian documents. In this thesis, we discuss the topic of similarity comparison of Indonesian documents and propose an effective algorithm. In the proposed method, we use the Porter Tala algorithm to change the Indonesian word to the root. Porter Tala is a famous method proposed by Fadillah Z Tala to find the roots of Indonesian words. After finding the roots, we use the cosine similarity to calculate the similarity of the documents. The experimental results show that our proposed method can effectively detect similar documents.

Acknowledgment I
Abstract In Chinese II
Abstract In English III
Table of Contents IV
List of Figures VI
List of Tables VII
List of Definition VIII
Chapter 1. Introduction 1
Chapter 2. Related Work 4
2.1 Plagiarism 4
2.1.1 Level of Plagiarism 4
2.1.2 Techniques of Plagiarism 5
2.2 Text Preprocessing 6
2.2.1 Tokenization 7
2.2.2 Stopword 7
2.2.3 Stemming 8
2.2.4 Term Weighting 10
2.3. Measuring Similarity 11
Chapter 3. Proposed Algorithm 14
3.1 Morphological Structure 14
3.2 Porter Tala 19
Chapter 4. Experimental Result 25
Chapter 5. Conclusion and Future Work 29
References 30

[1] G. Salton, and D. Harman, Information retrieval, p.^pp. 858-863: John Wiley and Sons Ltd., 2003.
[2] N. L. Beebe, and J. G. Clark, “Digital forensic text string searching: Improving information retrieval effectiveness by thematically clustering search results,” Digital Investigation, vol. 4, pp. 49-54, 2007.
[3] F. Benedetti et al., “Computing inter-document similarity with Context Semantic Analysis,” Information Systems, vol. 80, pp. 136-147, 2019.
[4] C. S. Saravana Kumar, and R. Santhosh, “Effective information retrieval and feature minimization technique for semantic web data,” Computers & Electrical Engineering, vol. 81, 2020.
[5] W. B. Frakes, Information retrieval: Data structures & algorithms: Pearson Education India, 1992.
[6] K. Baba, T. Nakatoh, and T. Minami, “Plagiarism detection using document similarity based on distributed representation,” Procedia Computer Science, vol. 111, pp. 382-387, 2017/01/01/, 2017.
[7] S. M. Weiss et al., Text mining: predictive methods for analyzing unstructured information: Springer Science & Business Media, 2010.
[8] H. Schütze, C. D. Manning, and P. Raghavan, "Introduction to information retrieval." p. 260.
[9] S. M. Weiss, N. Indurkhya, and T. Zhang, Fundamentals of predictive text mining: Springer, 2015.
[10] G. G. Chowdhury, “Natural language processing,” vol. 37, no. 1, pp. 51-89, 2003.
[11] N. Jung, and G. Lee, “Automated classification of building information modeling (BIM) case studies by BIM use based on natural language processing (NLP) and unsupervised learning,” Advanced Engineering Informatics, vol. 41, pp. 100917, 2019/08/01/, 2019.
[12] S. Sun, C. Luo, and J. Chen, “A review of natural language processing techniques for opinion mining systems,” Information Fusion, vol. 36, pp. 10-25, 2017.
[13] D. Soyusiawaty, and Y. Zakaria, "Book Data Content Similarity Detector With Cosine Similarity (Case study on digilib.uad.ac.id)." pp. 1-6.
[14] P. Willett, “The Porter stemming algorithm: Then and now,” Program electronic library and information systems, vol. 40, 07/01, 2006.
[15] M. Adriani et al., “Stemming Indonesian: A confix-stripping approach,” ACM Trans. Asian Lang. Inf. Process., vol. 6, 01/01, 2007.
[16] A. Arifin, and A. Setiono, "Classification of event news documents in Indonesian language using single pass clustering algorithm."
[17] J. Asian, H. E. Williams, and S. M. Tahaghoghi, "Stemming indonesian." pp. 307-314.
[18] F. Tala, “A Study of Stemming Effects on Information Retrieval in Bahasa Indonesia,” 12/07, 2003.
[19] H. Soelistyo, Plagiarisme : Pelanggaran Hak Cipta dan Etika Yogyakarta: Penerbit Kanisius, 2011.
[20] Y. HaCohen-Kerner, A. Tayeb, and N. Ben-Dror, “Detection of simple plagiarism in computer science papers,” in Proceedings of the 23rd International Conference on Computational Linguistics, Beijing, China, 2010, pp. 421–429.
[21] E. Nugroho, "Perancangan Sistem Deteksi Plagiarisme Dokumen Teks Dengan Menggunakan Algoritma-Rabin-Karp," http://blog.ub.ac.id/ecoorner/files/2011/03/Bab12345.pdf, [2019/12/9, 2011].
[22] B. Stein, and S. M. Zu Eissen, "Near similarity search and plagiarism analysis," From data and information analysis to knowledge engineering, pp. 430-437: Springer, 2006.
[23] V. Gurusamy, and S. Kannan, Preprocessing Techniques for Text Mining, 2014.
[24] C. D. Manning et al., Introduction to Information Retrieval: Cambridge University Press, 2008.
[25] W. Abdessalem, “A New Stemmer to Improve Information Retrieval,” International Journal of Network Security & Its Applications, vol. 5, pp. 143-154, 07/31, 2013.
[26] R. Sugumar, “Improved Performance Of Stemming Using Efficient Stemmer Algorithm For Information Retrieval,” Journal of Global Research in Computer Science, vol. 9, no. 5, pp. 01-05, 2018.
[27] J. Savoy, “A stemming procedure and stopword list for general French corpora,” Journal of the American Society for Information Science, vol. 50, no. 10, pp. 944-952, 1999.
[28] R. Schinke et al., “A stemming algorithm for Latin text databases,” vol. 52, no. 2, pp. 172-187, 1996.
[29] M. K. Saad, “The impact of text preprocessing and term weighting on arabic text classification,” 2010.
[30] J. Savoy, "Light stemming approaches for the French, Portuguese, German and Hungarian languages." pp. 1031-1035.
[31] C. G-Figuerola et al., "Stemming in Spanish: A first approach to its impact on information retrieval."
[32] V. Mateljan, V. Juričić, and D. J. S. Ogrizović, “Document similarity in repeatedly translated corpora,” vol. 1000, pp. 1, 2017.
[33] M. F. Porter, “An algorithm for suffix stripping,” program, vol. 14, no. 3, pp. 130-137, 1980.
[34] A. A. Hakim et al., "Automated document classification for news article in Bahasa Indonesia based on term frequency inverse document frequency (TF-IDF) approach." pp. 1-4.
[35] R. o. I. Dept. of Cultural and Education, "Pedoman Umum Ejaan Bahasa Indonesia," 2016, pp. 25-26.
[36] S. Y. Tai, C. S. Ong, and N. A. Abullah, On designing an automated Malaysian stemmer for the Malay language (poster session), Hong Kong, China: Association for Computing Machinery, 2000.

01.pdf

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文