不同語言的網站使用者導航偏好之大數據分析__國立東華大學博碩士論文全文影像系統

帳號：guest(3.147.75.221) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者:	Andre Freeman
作者(英文):	Andre Freeman
論文名稱:	不同語言的網站使用者導航偏好之大數據分析
論文名稱(英文):	Big Data Analytics on navigational preference of users in different languages
指導教授:	雍忠
指導教授(英文):	Chung Yung
口試委員:	莊庭瑞陳旻秀
口試委員(英文):	Tyng-Ruey Chuang Min-Xiou Chen
學位類別:	碩士
校院名稱:	國立東華大學
系所名稱:	資訊工程學系
學號:	610921313
出版年(民國):	111
畢業學年度:	110
語文別:	英文
論文頁數:	71
關鍵詞:	大數據分析、網站使用者行為、網站瀏覽偏好、知識發現技術應用
關鍵詞(英文):	Big data analytics、web user behavior、navigational preference、weblogs、knowledge discovery
相關次數:	推薦:0 點閱:28 評分: 下載:1 收藏:0

本論文的目的是將大數據分析技術應用於分析網站日誌大數據，以歸納網站使
用者以不同語言進行瀏覽的使用偏好。本論文研究所使用的數據包含來自國立東華大學 (NDHU) 資訊工程學系 (CSIE) 網站在 2020 年所有網站日誌數據。數據量共有5,066,905 個瀏覽點擊條目，分別瀏覽 9,590 個不同的 URL。
在本論文中，網頁主題空間定義為網站首頁以下的子目錄所有 URL 的資源，共有：(1)future students, (2)research, (3)newlist , (4)aboutus, (5)course, (6)activity, (7)alumni 和(8)resource。為了發現使用者的瀏覽偏好，研究人員將使用者的活動分為四組：(1)全英文活動,(2)全中文活動,(3)中英文活動,(4)其他活動。本論文將知識發現技術應用於 Time Selection Process (TSP)、Session Aggregation and Transformation Process(SATP)、Analysis Application Process(AAP)和 Analysis Results Evaluation and Application(AREA)等四個階段，並藉由這些分析進而提出商業智能(BI) 戰略建議。
為了發現使用者的網站瀏覽偏好，本論文研究採用了多種分析演算法： 1)
KMeans 演算法和 Elbow 方法，用於確定 TSP 部分在切割 session 時的最佳時間長度。 2) Apriori 演算法，用於查找使用者瀏覽偏好的 session 叢集。 3) 可變長度馬爾可夫鏈 (VLMC) 演算法，應用在 AAP 部分中查找使用者瀏覽偏好的頻繁序列。以及4) 潛在狄利克雷分配(LDA)演算法，應用於主題空間分類來進行網站熱門程度排序。
在本論文的結論中，從數據科學的角度提出了三項戰略建議，研究人員堅信這
些建議會對本研究產生正面影響，以幫助 NDHU CSIE 學系的業務環境決策。此外，本論文還添加了概述 2020 年前十大熱門網頁的附加信息，提供根據網站使用情況彙總 BI 應用程序。

The aim of this thesis is to apply data analytic techniques to the big data of weblog to discover the navigational preference for different languages’ for the department website. The data contains the entire collection for year 2020 from the National Dong Hwa University (NDHU) Computer Science and Engineering Department (CSIE) department website. There is total of 5,066,905 click entries that contain 9,590 unique URLs in the dataset.
In this thesis, a topic space is defined over the following subdirectories (1) future students, (2) research, (3) newlist, (4) aboutus, (5) course, (6) activity, (7) alumni and (8) resource for all URLs. To discover the user’s navigational preference the researcher utilizes users’ activity which we divide into three groups; (1) English only activity, (2) Chinese only activity and (3) all activity. The knowledge discovery process is applied to the four groups in various phases Timedelta Selection Process (TSP), Session Aggregation and Transformation Process (SATP), Analysis Application Process (AAP) and Analysis Results Evaluation and Application (AREA) to make various Business Intelligence (BI) strategic suggestions.
To discover navigational preferences of users various algorithms must be applied. 1) KMeans algorithm with Elbow method is utilized to determine best time for session in the TSP section. To discover navigational preferences, 2) Apriori is used for find frequent itemsets on users’ navigational preference along with 3) Variable Length Markov Chain (VLMC) to find frequent sequence in users’ navigational preferences in the AAP section. 4) Latent Dirichlet Allocation (LDA) is applied to take advantage of the topic space to rank top Nth pages.
In the conclusion of this thesis, three strategic suggestions from a data science stand point was suggested for which the researcher strongly believe would have the greatest impact from this study. Also, hope that the BI strategic suggestion would aid in the NDHU CSIE department business context decision making. Additional information outlining the basic website summary information for 2020 is also added to round up the BI application from website usage.

Acknowledgement i
Abstract iii
Abstract (Chinese) v
Table of Content vii
List of Figures ix
List of Tables xi
1. Introduction 1
1.1 Motivation and Goals 3
1.2 Organization of Thesis 3
2. Literature Review 5
2.1 Big Data 5
2.2 Knowledge Discovery Process 7
2.2.1 Stages of Data Analytics 7
2.2.2 Data Mining Techniques 8
2.3 Data Mining with Weblogs 12
2.4 Other Website Analysis tools 13
2.4.1 Latent Dirichlet Allocation LDA 13
3. Material and Overall Framework 15
3.1 Dataset 15
3.2 Overall Framework 16
3.3 Preservation of data, algorithms and results 17
4. Timedelta Selection Process (TSP) 18
4.1 Timedelta Generation 18
4.2 Timedelta Selection Analysis 19
5. Session Aggregation and Transformation Process (SATP) 22
5.1 Specialized data filtering, cleaning and segregation for weblog data 22
5.1.1 URL Based Filtering 22
5.1.2 URL based cleaning 23
5.1.3 URL based segregation 24
5.2 Session Aggregation 26
5.3 SATP Experimental Efforts 30
5.4 Discontinued efforts in SATP 31
5.4.1 Modes of the SATP 31
5.4.2 Multi-Client and Server Application of SATP 32
6. Analysis Application Process (AAP) 36
6.1 Apriori 36
6.2 VLMC 37
6.2 LDA application 39
6.3 Session First and Last page analysis via probability 40
7. Analysis Result Evaluation and Application 42
7.1 Lookup Transformation for Evaluation 42
7.2 Analysis Result Application 43
7.2.1 Strategic Suggestion N(1) 43
7.2.2 Strategic Suggestion N(2) 44
7.2.3 Strategic Suggestion N(3) 45
7.3 Analysis Result Evaluation 46
7.3.1 Strategic Suggestion 1 46
7.3.2 Strategic Suggestion 2 48
7.3.3 Strategic Suggestion 3 50
7.4 Other Results 50
Top 10 user preferred pages 51
8. Conclusion 54
Bibliography 55
Appendix 57

[1] Hurbean, L. (2005). Business intelligence: applications, trends, and strategies. Analele Stiintifice ale Universitatii" Alexandru Ioan Cuza" din Iasi-Stiinte Economice, 52, 307-312.
[2] Dharmaraajan, K., & Dorairangaswamy, M. A. (2016, October). Analysis of FP-growth and Apriori algorithms on pattern discovery from weblog data. In 2016 IEEE International Conference on Advances in Computer Applications (ICACA) (pp. 170-174). IEEE.
[3] Premchaiswadi, W., & Romsaiyud, W. (2012, June). Extracting weblog of Siam University for learning user behavior on MapReduce. In 2012 4th International Conference on Intelligent and Advanced Systems (ICIAS2012) (Vol. 1, pp. 149-154). IEEE.
[4] PG, O. P., Ananthakumaran, S., Sathishkumar, M., & Ganeshan, R. (2021, January). Analyzing the User Navigation Pattern from Web Logs Using Maximum Frequent Pattern Approach. In 2021 6th International Conference on Inventive Computation Technologies (ICICT) (pp. 877-883). IEEE.
[5] Duan, J., & Liu, S. (2012, August). Research on web log mining analysis. In 2012 International Symposium on Instrumentation & Measurement, Sensor Network and Automation (IMSNA) (Vol. 2, pp. 515-519). IEEE.
[6] Sathya, M., & Devi, P. I. (2017, March). Apriori algorithm on web logs for mining frequent link. In 2017 IEEE International Conference on Intelligent Techniques in Control, Optimization and Signal Processing (INCOS) (pp. 1-5). IEEE.
[7] Yung, C., Chen, C. C., Yuan, Y. L., & Li, C. (2019). A Systematic Model of Big Data Analytics for Clustering Browsing Records into Sessions Based on Web Log Data. J. Comput., 14(2), 125-133.
[8] Jain, V., & Kashyap, K. L. (2021, June). Optimal K-Means Clustering Algorithm for Weblog Mining. In 2021 10th IEEE International Conference on Communication Systems and Network Technologies (CSNT) (pp. 188-192). IEEE.
[9] Mächler, M., & Bühlmann, P. (2004). Variable length Markov chains: methodology, computing, and software. Journal of Computational and Graphical Statistics, 13(2), 435-455.
[10] Xu, G., Zhang, Y., & Yi, X. (2008, December). Modelling user behaviour for web recommendation using lda model. In 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (Vol. 3, pp. 529-532). IEEE.
[11] Laney, D. (2001). 3D data management: Controlling data volume, velocity and variety. META group research note, 6(70), 1.
[12] Ramadan, R. (2017). Big data tools-an overview. International Journal of Computer & Software Engineering, 2, 125.
[13] Patgiri, R., & Ahmed, A. (2016, December). Big data: The v's of the game changer paradigm. In 2016 IEEE 18th international conference on high performance computing and communications; IEEE 14th international conference on smart city; IEEE 2nd international conference on data science and systems (HPCC/SmartCity/DSS) (pp. 17-24). IEEE.
[14] Runkler, T. A. (2012). Data Analytics. doi:10.1007/978-3-8348-2589-6
[15] Cooley, R., Mobasher, B., & Srivastava, J. (1997, November). Web mining: Information and pattern discovery on the world wide web. In Proceedings ninth IEEE international conference on tools with artificial intelligence (pp. 558-567). IEEE.
[16] Li, Y. (2017, July). Research on technology, algorithm and application of web mining. In 2017 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC) (Vol. 1, pp. 772-775). IEEE.
[17] Omar, R., Tap, A. O. M., & Abdullah, Z. S. (2014, November). Web usage mining: A review of recent works. In The 5th International Conference on Information and Communication Technology for The Muslim World (ICT4M) (pp. 1-5). IEEE.
[18] Talia, D., Trunfio, P., & Marozzo, F. (2015). Data analysis in the cloud: models, techniques and applications. Elsevier(pp. 13-16).
[19] Mächler, M., & Bühlmann, P. (2004). Variable length Markov chains: methodology, computing, and software. Journal of Computational and Graphical Statistics, 13(2), 435-455.
[20] E Incerto, E., Napolitano, A., & Tribastone, M. (2020, November). Statistical Learning of Markov Chains of Programs. In 2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS) (pp. 1-8). IEEE.
[21] Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. the Journal of machine Learning research, 3, 993-1022.
[22] Xu, G., Zhang, Y., & Yi, X. (2008, December). Modelling user behaviour for web recommendation using lda model. In 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (Vol. 3, pp. 529-532). IEEE.
[23] Griffiths, T. (2002). Gibbs sampling in the generative model of latent dirichlet allocation.

(此全文20250210後開放外部瀏覽)
01.pdf

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文