260 likes | 408 Views
Frequent Word Combinations Mining and Indexing on HBase. Hemanth Gokavarapu Santhosh Kumar Saminathan. Introduction. Many projects use Hbase to store large amount of data for distributed computation
E N D
Frequent Word Combinations Mining and Indexing on HBase HemanthGokavarapu Santhosh Kumar Saminathan
Introduction • Many projects use Hbase to store large amount of data for distributed computation • The Processing of these data becomes a challenge for the programmers • The use of frequent terms help us in many ways in the field of machine learning • Eg: Frequently purchased items, Frequently Asked Questions, etc.
Problem • These projects on Hbase create indexes on multiple data • We are able to find the frequency of a single word easily using these indexes • It is hard to find the frequency of a combination of words • For example: “cloud computing” • Searching these words separately may lead to results like “scientific computing”, “cloud platform”
Objective • This project focuses on finding the frequency of a combination of words • We use the concept of Data mining and Apriori algorithm for this project • We will be using Map-Reduce and HBase for this project.
Survey Topics • Apriori Algorithm • HBase • Map – Reduce
Data Mining What is Data Mining? • Process of analyzing data from different perspective • Summarizing data into useful information.
Data Mining How Data Mining works? • Data Mining analyzes relationships and patterns in stored transaction data based on open – ended user queries What technology of infrastructure is needed? Two critical technological drivers answers this question. • Size of the database • Query complexity
Apriori Algorithm • Apriori Algorithm – Its an influential algorithm for mining frequent item sets for Boolean association rules. • Association rules form an very applied data mining approach. • Association rules are derived from frequent itemsets. • It uses level-wise search using frequent item property.
Apriori Algorithm & Problem Description If theminimum support is 50%, then {Shoes, Jacket} is the only 2- itemset that satisfies the minimum support. If the minimum confidence is 50%, then the only two rules generated from this 2-itemset, that have confidence greater than 50%, are: Shoes Jacket Support=50%, Confidence=66% Jacket Shoes Support=50%, Confidence=100%
Database D L1 C1 Scan D C2 C2 L2 Scan D L3 C3 Scan D Apriori Algorithm Example Min support =50%
Apriori Advantages & Disadvantages • ADVANTAGES: Uses larger itemset property Easily Parallelized Easy to Implement • DISADVANTAGES: Assumes transaction database is memory resident Requires many database scans
HBase What is HBase? • A Hadoop Database • Non - Relational • Open-source, Distributed, versioned, column-oriented store model • Designed after Google Bigtable • Runs on top of HDFS ( Hadoop Distributed File System )
Map Reduce • Framework for processing highly distributable problems across huge datasets using large number of nodes. / cluster. • Processing occur on data stored either in filesystem ( unstructured ) or in Database ( structured )
Mapper and Reducer • Mappers • FreqentItemsMap • -Finds the combination and assigns the key value for each combination • CandidateGenMap • AssociationRuleMap • Reducer • FrequentItemsReduce • CandidateGenReduce • AssociationRuleReduce
Flow Chart Start Find Frequent Items Find Candidate Itemsets Find Frequent Items No Set Null? Yes Generate Association Rules
Schedule • 1 week – Talking to the Experts at Futuregrid • 1 Week – survey of HBase, Apriori Algorithm • 4 Weeks -- Kick start on implementing Apriori Algorithm • 2 Weeks – Testing the code and get the results.
Conclusion • The execution takes more time for the single node • As the number of mappers getting increased, we come up with better performance • When the data is very large, single node execution takes more time and behaves weirdly
Known Issues • When the frequency is very low for large data set the reducer takes more time • Eg: A text paragraph in which the words are not repeated often.
Future Work • The analysis can be done with Twister and other platform • The algorithm can be extended for other applications that use machine learning techniques
References • http://en.wikipedia.org/wiki/Text_mining • http://en.wikipedia.org/wiki/Apriori_algorithm • http://hbase.apache.org/book/book.html • http://www2.cs.uregina.ca/~dbd/cs831/notes/itemsets/itemset_apriori.html • http://www.codeproject.com/KB/recipes/AprioriAlgorithm.aspx • http://rakesh.agrawal-family.com/papers/vldb94apriori.pdf