基於Spark架構上之循序性樣式探勘__國立東華大學博碩士論文全文影像系統

帳號：guest(18.188.229.228) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者:	許國宏
作者(英文):	Kuo-Hung Hsu
論文名稱:	基於Spark架構上之循序性樣式探勘
論文名稱(英文):	Incremental Sequential Pattern Mining on Spark
指導教授:	吳秀陽
指導教授(英文):	Shiow-Yang Wu
口試委員:	張耀中孫宗瀛吳秀陽
口試委員(英文):	Yao-Chung Chang Tsung-Ying Sun Shiow-Yang Wu
學位類別:	碩士
校院名稱:	國立東華大學
系所名稱:	資訊工程學系
學號:	610121032
出版年(民國):	106
畢業學年度:	106
語文別:	中文
論文頁數:	54
關鍵詞:	循序性樣式探勘、漸進式循序性樣式探勘
關鍵詞(英文):	Sequential pattern mining、MapReduce、Hadoop、Spark、Incremental sequential pattern mining
相關次數:	推薦:0 點閱:18 評分: 下載:20 收藏:0

循序樣式探勘在資料探勘的領域已發展多年，目的在於挖掘原始資料之間的先後順序中是否具有循序性。近年來資料量的增長速度越來越快，而分散式運算可以花費較少的成本完成龐大工作量的特性，使得其重要性與日俱增。MapReduce的提出進一步的簡化了分散式運算的工作，Hadoop叢集便是運用MapReduce架構對於大規模資料集進行運算，亦是現在運用最為廣泛的架構。本實驗室以前便曾經以大規模資料集作為運算對象，並以Hadoop作為運算架構進行靜態的循序樣式探勘研究。現今資料庫的更新日益頻繁，靜態的循序樣式探勘已無法滿足現況，動態的循序樣式探勘研究需求迫切。有鑑於此，本文希望發展對於大規模資料集進行動態的漸進式循序樣式探勘的方法。此外Hadoop執行分散式運算時仰賴硬碟I/O的架構，速度已逐漸無法應付頻繁更新的資料庫，因此許多相關改進方法被提出，Spark便是其中之一。Spark是與MapReduce相似的分散式運算架構，與Hadoop不同的是，Spark使用記憶體內計算的方式，省去了大量的I/O工作，官方宣稱速度至少能夠達到Hadoop的10倍。本文的另一個目的是利用Spark運算架構特性，實作漸進式循序樣式探勘演算法，並於Hadoop叢集上實作驗證其效能。
本文所提出的漸進式循序樣式探勘演算法，會依照時間先後順序將前一次與本次探勘區間的資料分為三個區段：過時區段、現存區段和新增區段。過時區段和現存區段已經在上一次的探勘時計算過，差別在於該區段的資料在本次探勘是否依舊處於有效時間，新增區段則是未曾進行過計算的新進資料。每一次的探勘結果由上一次的探勘結果刪除過時區段樣式並加入新增區段樣式得到，這種做法在計算探勘結果時節省了現存區段的運算工作，當現存區段佔據的比例越重，表示可以節省的工作量越多，效能越佳。
為了驗證漸進式循序樣式探勘演算法在Spark系統上的運作效能，本文亦實際架構一組Spark叢集作為實驗，並使用IBM所提供的資料產生器（IBM Quest Synthetic Data Generator）作為實驗用的人工大規模資料集。實驗結果顯示，漸進式循序樣式探勘會比以往每次探勘皆重新計算所有有效區段資料的做法在多個面向都具有更好的效能。

Sequential pattern mining has been studied in data mining research for years. The purpose is to discover sequential patterns from large datasets. Since the datasets increase faster and faster in recent years, distributed computing architecture for handling large datasets is becoming more and more important. MapReduce is a computing structure that greatly simplify distributed computing tasks. Hadoop is a widely used distributed computing architecture based on the MapReduce framework. Our laboratory used to develop algorithms for static sequential pattern mining with Hadoop MapReduce. Since database update is much more frequent day by day, static sequential pattern is unable to meet the challenge. Dynamic sequential pattern mining method is in urgent demand. Therefore we want to develop an incremental sequential pattern mining algorithm to analyze large dynamic datasets. On the other hand, Hadoop distributed computing depends heavily on disk I/O. Programs running on top of it can’t handle frequent database update gracefully. Therefore many research was conducted targeting such a deficiency. Spark is one of them. It is similar to MapReduce but employs in-memory computing to reduce disk I/O. It is reported that Spark runs at least 10 times faster than Hadoop MapReduce. Therefore we implement our algorithm on Spark.
We develop an incremental sequential pattern mining algorithm. The data for the last and current mining periods are split into 3 segments: obsolete data, counted data and new data. The difference between them is whether the data is still valid in current mining period. Obsolete and counted segments have already been mined in the last period while the new segment is yet to be processed. Mining result is derived from last mining result by removing pattern counts in the obsolete segment and adding counts in the new segment. The idea is to save the pattern computing workload of the obsolete and counted segments. The larger the counted proportion, the more saving can be achieved which leads to better performance.
To verify the performance of our incremental sequential pattern mining algorithm on Spark, we implement our algorithm on a Spark cluster and conduct extensive experiments. We use IBM Quest Synthetic Data Generator to generate large datasets for our experiments. Experiment results show that our algorithm achieves better performance than traditional method in most aspects.

致謝...........................................................i
摘要...........................................................iii
Abstract.......................................................v
圖目錄.........................................................vii
表目錄.........................................................ix
第一章序論...................................................1
1.1研究動機與目的...............................................1
1.2研究方法與成果...............................................2
1.3論文架構.....................................................3
第二章相關技術與研究..........................................5
2.1 Hadoop.....................................................5
2.2 Spark......................................................7
2.3 Scala......................................................11
2.4循序樣式探勘（Sequential pattern mining）.....................12
2.5漸進式循序樣式探勘（Incremental sequential pattern mining）...14
第三章問題定義與演算法.......................................19
3.1問題定義.....................................................19
3.2漸進式循序樣式探勘策略........................................20
3.3 Spark實作策略...............................................23
3.4 One-Phase演算法.............................................26
3.5直接計算演算法（Direct-Computing algorithm）..................27
3.6漸進式演算法（Incremental algorithm）.........................29
3.7運用Spark提升探勘效能.........................................31
第四章實驗與效能評估.........................................33
4.1實驗環境與測試資料............................................33
4.2實驗結果.....................................................35
4.2.1交易量可擴充性實驗（transactions experiment）...............35
4.2.2探勘區段大小實驗（window size experiment）..................38
4.2.3漸進式位移量實驗（slide experiment）........................41
4.2.4平均交易長度實驗（transaction length experiment）...........45
4.2.5實驗總結...................................................50
第五章結論與未來展望.........................................51
5.1結論.........................................................51
5.2未來展望.....................................................51
參考文獻 .....................................................52

[1]R. Agrawal, R. Srikant, “Mining Sequential Patterns”, Data Engineering, 11th International Conference on, pp. 3-14, 1995.
[2]R. Agrawal, R. Srikant., “Fast algorithms for mining association rules”, Very Large Data Base, 20th International Conference on, Vol. 1215, pp. 487-499, 1994.
[3]Ming Syan Chen, Jiawei Han, Philip S. Yu, “Data Mining：an overview from a database perspective”, IEEE Transactions on Knowledge and Data Engineering, Vol. 8(6), pp. 866-883, 1996.
[4]Hong Cheng, Xifeng Yan, Jiawei Han, “IncSpan：Incremental Mining of Sequential Patterns in Large Database”, Knowledge Discovery and Data Mining, 10th ACM SIGKDD International Conference on, pp. 527-532, 2004.
[5]Ming-Yen Lin, Suh-Yin Lee, “Incremental update on sequential patterns in large databases by implicit merging and efficient counting”, Information Systems, Vol. 29(5), pp. 385-404, 2004.
[6]Florent Masseglia, Pascal Poncelet, Maguelonne Teisseire, “Incremental mining of sequential patterns in large databases”, Data and Knowledge Engineering, Vol. 46(1), pp. 97-121, 2003.
[7]Jeffrey Dean, Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters”, Communications of the ACM, Vol. 51(1), pp. 107-113, January 2008.
[8]Xin Yue Yang, Zhen Liu, Yan Fu, “MapReduce as a programming model for association rules algorithm on Hadoop”, Information Sciences and Interaction Sciences, 3rd International Conference on, pp. 99-102, 2010.
[9]Ling juan Li, Min Zhang, “The strategy of Mining Association Rule Based on Cloud Computing”, Business Computing and Global Information, 2011 IEEE International Conference on, pp. 475-478, 2011.
[10]Chun-Chieh Chen, Chi-Yao Tseng, Ming-Syan Chen, “Highly Scalable Sequential Pattern Mining Based on MapReduce on the Cloud”, Big Data Congress, 2013 IEEE International Congress on, 2013.
[11]Zahra Farzanyar, Nick Cercone, “Efficient mining of frequent itemsets in social network data based on MapReduce framework”, Advances in Social Networks Analysis and Mining, 2013 IEEE/ACM International Conference on, ACM, 2013.
[12]The Apache Software Foundation, Apache Hadoop, https://hadoop.apache.org, October 2017.
[13]Yen-hui Liang, Shiow-yang Wu, “Sequence-Growth：A Scalable and Effective Frequent Itemset Mining Algorithm for Big Data Based on MapReduce Framework”, Big Data, 2015 IEEE International Congress on, pp. 393-400, 2015.
[14]The Apache Software Foundation, Apache Spark, https://spark.apache.org, October 2017.
[15]Scala, https://www.scala-lang.org/, October 2017.
[16]Bjarne Stroustrup, AT&T Bell Laboratories, Murray Hill, New Jersey, Why C++ is not just an object-oriented programming language, Vol. 6(4), ACM, 1995.
[17]Helen Ponto, Jiawei Han, Jian Pei, Ke Wang, “Multi-dimensional Sequential Pattern Mining”, Information and Knowledge Management, 10th International Conference on, ACM, pp. 81-88, 2001.
[18]David C. Anastasiu, Jeremy Iverson, Shaden Smith, George Karypis, “Big Data Frequent Pattern Mining”, Frequent Pattern Mining, Springer International Publishing, Aggarwal C., Han J. (eds), pp. 225-259, 2014.
[19]Jen-Wei Huang, Chi-Yao Tseng, Jian-Chih Ou, Ming-Syan Chen, “On Progressive Sequential Pattern Mining”, Information and Knowledge Management, 15th International Conference on, ACM, pp. 850-851, 2006.
[20]Jen-Wei Huang, Chi-Yao Tseng, Jian-Chih Ou, Ming-Syan Chen, “A General Model for Sequential Pattern Mining with a Progressive Database”, IEEE Transactions on Knowledge and Data Engineering, Vol. 20(9), pp. 1153-1167, 2008.

01.pdf

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文