210 likes | 446 Views
COMP 6521 PROJECT External Sorting Algorithm Mining Frequent Itemsets from Secondary Memory. By Team 1 Jun-Duo Chen Ying Luo Yichen Li. Project 1: External Sorting Algorithm. Problem Statement Design Principles Implementation details Optimization Results. Problem Statement.
E N D
COMP 6521PROJECTExternal Sorting Algorithm Mining Frequent Itemsets from Secondary Memory By Team 1 Jun-Duo Chen Ying Luo Yichen Li
Project 1: External Sorting Algorithm • Problem Statement • Design Principles • Implementation details • Optimization • Results COMP 6521 Advanced Database Systems and Applications
Problem Statement • Goal: External sorting • Specifications • Algorithm: 2PMMS • Language: Java • Restriction: 5MB main memory COMP 6521 Advanced Database Systems and Applications
Design Principles • Automatic integration of multi-pass Phase 2 • Modular design for areas intended for optimization COMP 6521 Advanced Database Systems and Applications
Optimization (Wish List) • Buffer management • Total buffer size • Data type • Number/size of buffers (2PMMS) • I/O module • Selection of optimal reader / writer • Sorting algorithm • Quicksort, radix sort, etc. COMP 6521 Advanced Database Systems and Applications
Implementation Details • Fixed total buffer size • Int data type • Variable maximum size for each buffer • Fixed minimum buffer size • Single-pass phase II for large sample dataset • BufferedReader & BufferedWriter • Java integrated arrays sort (Quicksort from documentation) COMP 6521 Advanced Database Systems and Applications
Results Execution time is 14-15s COMP 6521 Advanced Database Systems and Applications
Project II: Mining Frequent Itemsets from Secondary Memory • Problem Statement • Algorithms Considered • Chosen Algorithm and Motivation • Description of Algorithm • Design Principles • Result COMP 6521 Advanced Database Systems and Applications
Problem Statement • Compute frequent items of all sizes (pairs, triples, quadruples, etc) given a file containing all transactions • Restriction of 5MB main memory usage and one disk for secondary storage COMP 6521 Advanced Database Systems and Applications
Algorithms Considered • Apriori • Generate a large number of candidates under a limited main memory usage • Multiple scans of the file to get candidates and counts, I/O consuming incurred COMP 6521 Advanced Database Systems and Applications
Algorithm Considered • PCY • Hashing function design is non-trivial • Potential I/O degradation due to increased processing time between each transaction COMP 6521 Advanced Database Systems and Applications
Algorithm Considered • FP Tree • Memory consuming of nodes of the tree • Better performance for large data file, as the data being compressed • Sufficient memory required to hold compressed data in main memory COMP 6521 Advanced Database Systems and Applications
Chosen Algorithm and Motivation • Improved Apriori algorithm • Triangular matrix is used to save memory COMP 6521 Advanced Database Systems and Applications
Description of algorithm • Customized algorithm in first three passes to reduce memory consumption. • Uniform algorithm for any remaining passes. COMP 6521 Advanced Database Systems and Applications
Description of algorithm • Pass one: generate frequent items • Read and parse input file. • Update candidate item list and counts. • When reach support threshold, add to frequent item list • Output: frequent items and counts. COMP 6521 Advanced Database Systems and Applications
Description of algorithm • Pass two: generate frequent pairs, A-Priori algorithm • Read and parse input file. • Double loop over frequent items to generate candidate pairs • Store candidate pair count using Triangular Matrix • k = (i -1)(n – i/2) + j + I , k-index of count array • When support threshold is reached, add to frequent-pair list. • Output: frequent-pair list and counts. COMP 6521 Advanced Database Systems and Applications
Description of algorithm • Pass three: generate frequent triples, A-Priori algorithm • Store frequent pairs into a HashSet • Read and parse each line from input file. • Double loop items; check with HashSet to generate candidate pair. • Generate candidate triples and count based on candidate pairs. • When support threshold is reached, add to frequent-triple list. • Output: frequent-triple list and count. COMP 6521 Advanced Database Systems and Applications
Description of algorithm • Remaining passes: generate frequent itemsets, A-Priori algorithm • Read and parse input file. • Generate candidates based on the output of previous step. Count the candidate set. • When support threshold is reached, add to frequent-set list. • Output: frequent-set list and counts. COMP 6521 Advanced Database Systems and Applications
Design Principles • Data structure: • Item, Pair, Triple and FrequentItemSet • ArrayList • int[] • Memory management: • Only store output of previous pass in MM. • I/O design: • BufferedReader and BufferedWriter COMP 6521 Advanced Database Systems and Applications
Demo Result Program execution time is 121 seconds. COMP 6521 Advanced Database Systems and Applications
Thank you ! COMP 6521 Advanced Database Systems and Applications