作者(英文):Chia-Ching Chen
論文名稱(英文):Big Data Analysis for Largest Combination of Frequently Visited Web Pages Based on Web Log Data
指導教授(英文):Chung Yung
口試委員(英文):Yu-Lan Yuan
Wuu Yang
關鍵詞(英文):AprioriBig dataWeb mining
在本論文中,我們針對基於網頁日誌的數據為經常被訪問的網頁最大組合提出了一個新的大數據分析方法。我們的方法是以Apriori演算法為基礎,加以變化應用。Apriori演算法是資料探勘的關聯規則技術中常被使用的演算法之一,它的主要做法是通過候選項目集的生成和向下封閉檢測(downward closure detection)來探勘頻繁項目集,因為這些特性相似,我們產生了應用Apriori演算法到網頁日誌中計算經常被訪問的網頁最大組合之動機。



In this thesis, we present a big data analysis for the largest combinations of frequently visited web pages based on web log data. With modifications, we apply the Apriori algorithm to compute the combination. The Apriori algorithm is one of the popularly used association rule algorithms. The primary idea is to mine the frequent itemsets through candidate itemsets generations and downward closure detection. Due to the similarity in attributes, it gives us a motivation of applying the Apriori algorithms to compute the largest combinations of frequently visited web pages.

Since the Internet, now indispensable in human daily life, has features that allow information to flow quickly, the web log data generated by web servers become a good resource for us to analyze the behavior of web users. The analysis of the largest combinations of frequently visited web pages is interesting and requested by a lot of web site administrators, and we call it LCF analysis. By applying the LCF analysis, we can know better of the behavior of web users when they browse web pages. Web site administrators may use the knowledge to improve the content of the web pages in order to enhance the user satisfaction. In the experiments, we use the web log data from the official web site of Taiwan Tourism Bureau in the spanning time of lantern festival (2017.11.01-2018.03.11). As a summary, the web log data has a total of 55,318,326 records, which are clustered into 307,154 visit sessions. We conduct a series of experiments with thresholds between 0.2% and 0.5% of the number of records, and get the largest combinations of frequently visited web pages.

As a side effect of developing the method of LCF analysis, we find that with more modification on the Apriori algorithm, we may detect non-human users in the visit sessions, and we call it NUD. As a result of NUD analysis, we actually detect 4 non-human users, who are accounted for more than 10 million records. Excluding the recodes by the 4 detected non-human users, we redo the LCF experiments. As a result, we get different combinations of frequently visited web pages. We believe that such combinations are the web pages frequently visited by human users.
1 Introduction 1
2 Background 5
2.1 Big Data 5
2.2 Web Mining 7
2.3 Phases in Big Data Analysis 8
2.4 Association Rules 10
3 LCF: A New Method of Largest Combination of Frequently Visited Web Pages 15
3.1 Five phases of Big Data Analysis on Web Log Data 15
3.2 Analysis Algorithm of LCF 25
3.3 An Example 27
4 Experiments on LCF 33
4.1 Overall Structure of a LCF System 34
4.2 Experiment 1: using a threshold of 1535 44
4.3 Experiment 2: using a threshold of 1228 47
4.4 Experiment 3: using a threshold of 921 48
4.5 Experiment 4: using a threshold of 614 49
4.6 Comparison 50
5 NUD: Non-human User Detection as a Variation of LCF 53
5.1 Five phases of Big Data Analysis on Web Log Data 53
5.2 Analysis Algorithm of NUD 54
5.3 Analysis Process 56
6 Experiment on NUD 59
6.1 Overall Structure of a NUD System 59
6.2 Experiment Result 63
6.3 Experiment 1: using a threshold of 1533 65
6.4 Experiment 2: using a threshold of 1226 67
6.5 Experiment 3: using a threshold of 919 68
6.6 Experiment 4: using a threshold of 613 69
7 Conclusion 73
