1 / 21

COMP 6521 PROJECT External Sorting Algorithm  Mining Frequent Itemsets from Secondary Memory

COMP 6521 PROJECT External Sorting Algorithm  Mining Frequent Itemsets from Secondary Memory. By Team 1 Jun-Duo Chen Ying Luo Yichen Li. Project 1: External Sorting Algorithm. Problem Statement Design Principles Implementation details Optimization Results. Problem Statement.

tadhg
Download Presentation

COMP 6521 PROJECT External Sorting Algorithm  Mining Frequent Itemsets from Secondary Memory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COMP 6521PROJECTExternal Sorting Algorithm Mining Frequent Itemsets from Secondary Memory By Team 1 Jun-Duo Chen Ying Luo Yichen Li

  2. Project 1: External Sorting Algorithm • Problem Statement • Design Principles • Implementation details • Optimization • Results COMP 6521 Advanced Database Systems and Applications

  3. Problem Statement • Goal: External sorting • Specifications • Algorithm: 2PMMS • Language: Java • Restriction: 5MB main memory COMP 6521 Advanced Database Systems and Applications

  4. Design Principles • Automatic integration of multi-pass Phase 2 • Modular design for areas intended for optimization COMP 6521 Advanced Database Systems and Applications

  5. Optimization (Wish List) • Buffer management • Total buffer size • Data type • Number/size of buffers (2PMMS) • I/O module • Selection of optimal reader / writer • Sorting algorithm • Quicksort, radix sort, etc. COMP 6521 Advanced Database Systems and Applications

  6. Implementation Details • Fixed total buffer size • Int data type • Variable maximum size for each buffer • Fixed minimum buffer size • Single-pass phase II for large sample dataset • BufferedReader & BufferedWriter • Java integrated arrays sort (Quicksort from documentation) COMP 6521 Advanced Database Systems and Applications

  7. Results Execution time is 14-15s COMP 6521 Advanced Database Systems and Applications

  8. Project II: Mining Frequent Itemsets from Secondary Memory • Problem Statement • Algorithms Considered • Chosen Algorithm and Motivation • Description of Algorithm • Design Principles • Result COMP 6521 Advanced Database Systems and Applications

  9. Problem Statement • Compute frequent items of all sizes (pairs, triples, quadruples, etc) given a file containing all transactions • Restriction of 5MB main memory usage and one disk for secondary storage COMP 6521 Advanced Database Systems and Applications

  10. Algorithms Considered • Apriori • Generate a large number of candidates under a limited main memory usage • Multiple scans of the file to get candidates and counts, I/O consuming incurred COMP 6521 Advanced Database Systems and Applications

  11. Algorithm Considered • PCY • Hashing function design is non-trivial • Potential I/O degradation due to increased processing time between each transaction COMP 6521 Advanced Database Systems and Applications

  12. Algorithm Considered • FP Tree • Memory consuming of nodes of the tree • Better performance for large data file, as the data being compressed • Sufficient memory required to hold compressed data in main memory COMP 6521 Advanced Database Systems and Applications

  13. Chosen Algorithm and Motivation • Improved Apriori algorithm • Triangular matrix is used to save memory COMP 6521 Advanced Database Systems and Applications

  14. Description of algorithm • Customized algorithm in first three passes to reduce memory consumption. • Uniform algorithm for any remaining passes. COMP 6521 Advanced Database Systems and Applications

  15. Description of algorithm • Pass one: generate frequent items • Read and parse input file. • Update candidate item list and counts. • When reach support threshold, add to frequent item list • Output: frequent items and counts. COMP 6521 Advanced Database Systems and Applications

  16. Description of algorithm • Pass two: generate frequent pairs, A-Priori algorithm • Read and parse input file. • Double loop over frequent items to generate candidate pairs • Store candidate pair count using Triangular Matrix • k = (i -1)(n – i/2) + j + I , k-index of count array • When support threshold is reached, add to frequent-pair list. • Output: frequent-pair list and counts. COMP 6521 Advanced Database Systems and Applications

  17. Description of algorithm • Pass three: generate frequent triples, A-Priori algorithm • Store frequent pairs into a HashSet • Read and parse each line from input file. • Double loop items; check with HashSet to generate candidate pair. • Generate candidate triples and count based on candidate pairs. • When support threshold is reached, add to frequent-triple list. • Output: frequent-triple list and count. COMP 6521 Advanced Database Systems and Applications

  18. Description of algorithm • Remaining passes: generate frequent itemsets, A-Priori algorithm • Read and parse input file. • Generate candidates based on the output of previous step. Count the candidate set. • When support threshold is reached, add to frequent-set list. • Output: frequent-set list and counts. COMP 6521 Advanced Database Systems and Applications

  19. Design Principles • Data structure: • Item, Pair, Triple and FrequentItemSet • ArrayList • int[] • Memory management: • Only store output of previous pass in MM. • I/O design: • BufferedReader and BufferedWriter COMP 6521 Advanced Database Systems and Applications

  20. Demo Result Program execution time is 121 seconds. COMP 6521 Advanced Database Systems and Applications

  21. Thank you ! COMP 6521 Advanced Database Systems and Applications

More Related