190 likes | 344 Views
Professor: Dr . Gosta Grahne Lab Instructor: A shkan azarnik Group 5 Deyvid William Romeo Honvo Venkatesh S R. Advance Database Systems and Applications COMP 6521. Contents. Project 1 External Sorting Algorithm, 2PMMS Implementation Project 2
E N D
Professor:Dr. Gosta Grahne Lab Instructor:Ashkan azarnikGroup 5 Deyvid William Romeo Honvo Venkatesh S R Advance Database Systems and Applications COMP 6521
Contents • Project 1 External Sorting Algorithm, 2PMMS Implementation • Project 2 Mining Frequent Itemsets from Secondary Memory Part 1: Problem Analysis & Algorithm consideration Part 2: Algorithm Description & Design principles
PROJECT 1 • Develop a program which sort numbers in ascending order using 2 Phase Multiway Merge Sort(2PMMS) with limitation of 5MB of virtual memory. • External sorting is required when the data being sorted do not fit into the main memory of a computing device and instead they must reside in slower external memory (usually hard drive).
Two-Phase Multiway Merge-Sort (2PMMS) Solution Unsorted File Sorted File Sorted Runs Phase 2 Phase1
Approach to the problem • In the 1ST Phase, chunks of data that fit in main • memory are read, sorted using the built-in • function from Arrays class (Java) and written out • to temporary files. • In the 2nd Phase (Merging), the sorted temporary • files are combined using 2 phase multiway • merge sort into a single larger file.
Challenges Faced Which algorithm to choose ? • After a few tests, we decided to use the built-in sort function from Java that implements a tuned quicksort algorithm. • This algorithm offers n*log(n) performance on many data sets that cause other quicksort's to degrade to quadratic performance. • Efficient average case compared to other sort algorithms. • A buffer of size 750,000 was used for the 1st phase • newBufferedReader from Java 7 used to read files
List of Data Structures • Primitive Types: Boolean, Integer, Long • Abstract Types: Array, String • Arrays (Linear Data Structure) Integer Array, Boolean Array, Long Array • I/O: newBufferedReader
Project2 Mining Frequent Itemsets from Secondary Memory Develop an application that will compute the frequent itemsets of all sizes (Pairs, Triples, Quadruples, etc.) from a set of transactions based on input support threshold percentage.
Algorithms Considered FP-Growth vs Eclat Eclat uses a purely vertical representation whereas FP-growth combines in its FP-tree structure both vertical and horizontal representations Fp-Growth takes lot of memory and difficult to implement compared to Eclat
ECLAT Better Execution Time Memory Efficient Basic algorithm Very good for dense datasets Require less amount of memory compared to FP-growth Map of Bitsets
ECLAT ImplementationList of Data Structures • Primitive Types • Boolean, Integer, Double • Abstract Types • Hash Map • String • Arrays • Array List (Dynamic) • Bit Set (Bit Array) • String Array
ECLAT Implementation 1.Scan original file, find frequent items 2.Generate n partitions (files) that contain groups of frequent items 3.Read every file, register items/transactions, find and write items in the output file
ECLAT Implementation Divide and conquer approach Algortihm based on the concepts of Diskmine and Projection described in Professor’s paper “Mining Frequent Itemsets from Secondary Memory” Large database is decomposed into a number of small databases to be processed Each database contains a percentage of frequent items and all greater items in the same transaction
ECLAT Implementation 1.Scan original file, find frequent items 2.Generate n partitions (files) that contain groups of frequent items 3.Read every file, register items/transactions, find and write items in the output file
ECLAT Implementation Improved 1.Scan original file, find frequent items 2.Generate n partitions that contain groups of frequent items based on the frequency 3.Read every file, register items/transactions, find and write items in the output file
Thanks! Merci!