帳號:guest(3.144.242.235)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目勘誤回報
作者:陳家慶
作者(英文):Chia-Ching Chen
論文名稱:基於網站日誌資料進行常到訪網頁最大組合之大數據分析
論文名稱(英文):Big Data Analysis for Largest Combination of Frequently Visited Web Pages Based on Web Log Data
指導教授:雍忠
指導教授(英文):Chung Yung
口試委員:原友蘭
楊武
口試委員(英文):Yu-Lan Yuan
Wuu Yang
學位類別:碩士
校院名稱:國立東華大學
系所名稱:資訊工程學系
學號:610521240
出版年(民國):107
畢業學年度:106
語文別:英文
論文頁數:78
關鍵詞:大數據Apriori網頁探勘
關鍵詞(英文):AprioriBig dataWeb mining
相關次數:
  • 推薦推薦:0
  • 點閱點閱:33
  • 評分評分:系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔
  • 下載下載:6
  • 收藏收藏:0
在本論文中,我們針對基於網頁日誌的數據為經常被訪問的網頁最大組合提出了一個新的大數據分析方法。我們的方法是以Apriori演算法為基礎,加以變化應用。Apriori演算法是資料探勘的關聯規則技術中常被使用的演算法之一,它的主要做法是通過候選項目集的生成和向下封閉檢測(downward closure detection)來探勘頻繁項目集,因為這些特性相似,我們產生了應用Apriori演算法到網頁日誌中計算經常被訪問的網頁最大組合之動機。

現在的互聯網已經成為了人類日常生活中不可或缺的一部份,它具有允許信息快速流動的特徵,因此網頁服務器生成的日誌數據成為我們分析網站用戶行為的良好資源。

分析經常訪問的網頁最大組合,我們稱之為LCF分析。使用LCF分析,我們可以找出瀏覽網頁的訪問者的行為,然後去改善網站內容,增加用戶滿意度。因為Apriori演算法可以探勘所有頻繁的項目集,所以我們使用它來進行LCF分析。我們使用的原始數據來源是台灣觀光局的網站,我們以元宵節燈會期間的網頁日誌(2017.11.01-2018.03.11)為主要的實驗對象。原始數據共有55,318,326筆紀錄,經分析歸納為307,154個到訪區間。我們以閾值從0.2%到0.5%,進行了一系列的四組實驗,分別計算出網頁個數不同的經常被訪問的網頁最大組合。

Apriori演算法不僅可應用於LCF分析,透過修改Apriori演算法,我們可以應用它來做非人類用戶的偵測,我們把這個方法稱為NUD演算法。我們使用NUD演算法進行分析,結果我們發現了四個非人類用戶,它們在原始數據中佔了1,000多萬筆。然後,我們從原始數據中刪除了非人類用戶的使用紀錄並重做了閾值為0.2%到0.5%的LCF分析,並重新計算經常被訪問的網頁最大組合。在每一組實驗,我們得到了不同的組合。我們相信,這個結果更接近人類用戶經常訪問的網頁最大組合。
In this thesis, we present a big data analysis for the largest combinations of frequently visited web pages based on web log data. With modifications, we apply the Apriori algorithm to compute the combination. The Apriori algorithm is one of the popularly used association rule algorithms. The primary idea is to mine the frequent itemsets through candidate itemsets generations and downward closure detection. Due to the similarity in attributes, it gives us a motivation of applying the Apriori algorithms to compute the largest combinations of frequently visited web pages.

Since the Internet, now indispensable in human daily life, has features that allow information to flow quickly, the web log data generated by web servers become a good resource for us to analyze the behavior of web users. The analysis of the largest combinations of frequently visited web pages is interesting and requested by a lot of web site administrators, and we call it LCF analysis. By applying the LCF analysis, we can know better of the behavior of web users when they browse web pages. Web site administrators may use the knowledge to improve the content of the web pages in order to enhance the user satisfaction. In the experiments, we use the web log data from the official web site of Taiwan Tourism Bureau in the spanning time of lantern festival (2017.11.01-2018.03.11). As a summary, the web log data has a total of 55,318,326 records, which are clustered into 307,154 visit sessions. We conduct a series of experiments with thresholds between 0.2% and 0.5% of the number of records, and get the largest combinations of frequently visited web pages.

As a side effect of developing the method of LCF analysis, we find that with more modification on the Apriori algorithm, we may detect non-human users in the visit sessions, and we call it NUD. As a result of NUD analysis, we actually detect 4 non-human users, who are accounted for more than 10 million records. Excluding the recodes by the 4 detected non-human users, we redo the LCF experiments. As a result, we get different combinations of frequently visited web pages. We believe that such combinations are the web pages frequently visited by human users.
1 Introduction 1
2 Background 5
2.1 Big Data 5
2.2 Web Mining 7
2.3 Phases in Big Data Analysis 8
2.4 Association Rules 10
3 LCF: A New Method of Largest Combination of Frequently Visited Web Pages 15
3.1 Five phases of Big Data Analysis on Web Log Data 15
3.2 Analysis Algorithm of LCF 25
3.3 An Example 27
4 Experiments on LCF 33
4.1 Overall Structure of a LCF System 34
4.2 Experiment 1: using a threshold of 1535 44
4.3 Experiment 2: using a threshold of 1228 47
4.4 Experiment 3: using a threshold of 921 48
4.5 Experiment 4: using a threshold of 614 49
4.6 Comparison 50
5 NUD: Non-human User Detection as a Variation of LCF 53
5.1 Five phases of Big Data Analysis on Web Log Data 53
5.2 Analysis Algorithm of NUD 54
5.3 Analysis Process 56
6 Experiment on NUD 59
6.1 Overall Structure of a NUD System 59
6.2 Experiment Result 63
6.3 Experiment 1: using a threshold of 1533 65
6.4 Experiment 2: using a threshold of 1226 67
6.5 Experiment 3: using a threshold of 919 68
6.6 Experiment 4: using a threshold of 613 69
7 Conclusion 73
[1] Y. Gashaw and F. Liu, “Performance evaluation of frequent pattern mining algorithms using web log data for web usage mining.” 2017 10th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), IEEE, 2017, pp. 1-5.
[2] Agrawal, Rakesh and Ramakrishnan Srikant. “Fast algorithms for mining association rules,” Proc. 20th int. conf. very large data bases, VLDB. Vol. 1215. 1994.
[3] Agrawal, Rakesh, Tomasz Imieliski, and Arun Swami. “Mining association rules between sets of items in large databases.” Acm Sigmod Record, Vol. 22. No. 2. ACM, 1993.
[4] Zaki, Mohammed Javeed, et al. “New Algorithms for Fast Discovery of Association Rules.” KDD, Vol. 97. 1997.
[5] Han, Jiawei, Jian Pei, and Yiwen Yin. “Mining frequent patterns without candidate generation.” ACM Sigmod Record, Vol. 29. No. 2. ACM, 2000.
[6] Brin, Sergey, et al. “Dynamic itemset counting and implication rules for market basket data.” Acm Sigmod Record, 26.2 (1997): 255-264, 1997.
[7] Ansari, Asif, and Ajit Parab. “Apriori-A Big Data Analysis in Education.” IJISET - International Journal of Innovative Science, Engineering & Technology, Vol. 1 Issue 10, December 2014., 2014.
[8] Dengwu, Yang and Zhurong, Zhou. “Personalized mining of preferred paths based on web log.” Electronic Measurement & Instruments (ICEMI), 2013 IEEE 11th International Conference on, IEEE, 2013, pp. 993-997.
[9] Jiang, Zi-lei and Song, Shun-lin. “Design and implementation of discovering preferred browsing paths from Web logs algorithm.” Educational and information Technology (ICEIT), 2010 International Conference on, IEEE, 2010, pp. v4-415.
[10] Ahmed, Chowdhury Farhan and Tanbeer, Syed Khairuzzaman and Jeong Byeong-Soo. “Mining High Utility Web Access Sequences in Dynamic Web Log Data.” Software Engineering Artificial Intelligence Networking and parallel/Distributed Computing (SNPD), 2010 11th ACIS International conference on, IEEE, 2010, pp. 76-81.
[11] Dietrich, David and Heller, Barry and Yang, Beibei, “Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data.”, 2015.
[12] Laney, Doug. “3D data management: Controlling data volume, velocity and variety.” META Group Research, Note 6.70, 2001.
[13] S. Shafiee and A. R. Ghatari, “Big data in tourism industry.” 2016 10th International Conference on e-Commerce in Developing Countries: with focus on e-Tourism (ECDC), Isfahan, 2016, pp. 1-7.
[14] Etzioni, Oren, “The World-Wide Web: quagmire or gold mine?” communications of the ACM, 39(11), ACM, 1996, pp. 65-68.
[15] Cooley, Robert, Bamshad Mobasher, and Jaideep Srivastava, “Web mining: Information and pattern discovery on the world wide web.” Tools with Artificial Intelligence, 1997. Proceedings., Ninth IEEE International Conference on., IEEE, 1997, pp. 558-567.
[16] Kosala, Raymond, and Hendrik Blockeel, “Web mining research: A survey.” ACM Sigkdd Explorations Newsletter 2.1 , ACM, 2000, pp. 1-15.
[17] Yung, Chung, “Mining Massive Web Log Data of an Official Tourism Web Site as a Step towards Big Data Analysis in Tourism.” Proceedings of the ASE BigData & SocialInformatics 2015, ACM, 2015, pp. 62.
[18] Nagi, Mohamad and ElSheikh, Abdallah and Sleiman, Iyad and Peng, Peter and Rifaie, Mohammad and Kianmehr, Keivan and Karampelas, Panagiotis and Ridley, Mick and Rokne, Jon and Alhajj, Reda, “Association rules mining based approach for web usage mining.” Information Reuse and Integration (IRI), 2011 IEEE International Conference on, IEEE, 2011, pp. 166-171.
[19] Addi, Ait-Mlouk, Agouti Tarik, and Gharnati Fatima. “Comparative survey of association rule mining algorithms based on multiple-criteria decision analysis approach.”
[20] Dharmaraajan, K., and M. A. Dorairangaswamy. “Analysis of FP-growth and Apriori algorithms on pattern discovery from weblog data.” Advances in Computer Applications (ICACA), IEEE International Conference on. IEEE, 2016.
[21] Chareyron, Gal, Jrme Da-Rugna, and Thomas Raimbault. “Big data: A new challenge for tourism.” 2014 IEEE international conference on. IEEE, 2014.
[22] Kotiyal, Bina, et al. “User behavior analysis in web log through comparative study of Eclat and Apriori.” Intelligent Systems and Control (ISCO), 2013 7th International Conference on. IEEE, 2013.
[23] Chung Yung, Chin-Ching Chen, Yu-Lan Yuan, Ching Li. “A systematic model of big data analytics for clustering browsing record into sessions based on web log data.” Submitted to 2018 ICCEE, International Conference on Computer and Electrical Engineering.
(此全文未開放授權)
01.pdf
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *