COMP – 6521 Advance database SYSTEMS and applications

COMP – 6521 Advance database SYSTEMS and applications Project Presentation Submitted To: Submitted By: Professor Gosta Grahane Kulbir Sandhu Priyanka Shukla Group - 16

Project - 1 External Sorting

Problem Statement • Build an application that will sort data (integer data) in ascending order using External Sorting. • Input :-A Data file containing large number of integers. • Output:- A file containing all the numbers sorted in ascending order. • Restrictions: • Language restricted to java • Application should work with data of any size.

Our Approach • Using External Sorting. • Phase I :- • Used buffer reader to read every sublist. • create sorted sublist of size 4MB using quicksort algorithm (Simple and efficient) • Phase II :- • create n input buffers where n = number of sorted sublist and write output to disk.

Our Approach (contd.) • Buffer the data from all sorted sublist and choose the minimum among them at first position. Output it to output buffer. • If any input buffer gets empty, fill it again from the corresponding sublist. If sublist gets empty, let buffer remain empty. • If output buffer gets full, copy the data from output buffer to final sorted output file and flush the output buffer.

Pros • Fast as Arrays have been used • No loss of data. • Easy to implement

RESULT • Small: 2 secs • Medium: 5 secs • Large: 12 seconds.

Project – 2Mining Frequent Itemsets from Secondary Memory

Problem Statement • Build an application that will compute all size of frequent itemsets (Pairs, Triples, Quadruples etc.) from a set of transactions. • Input:- a data file containing integers and support threshold. • Output:- a file containing each frequent itemset (Pairs, Triplets, Quadruples etc) with their support.

Restrictions • language is restricted to Java. • One disk and main memory usage is limited to 5 MB. • Short data type should not be used.

Naive Approach • Using Triangular Matrix or Triples scheme, we could have easily find pairs. • Problem: This Approach does not work as we do have a main memory restriction. In addition the data file contains many transactions

Algorithms Considered: • A-Priori Algorithm. • PCY Algorithm. • Multihash Algorithm.

Why Pcy • Improvement to Apriori • Uses memory efficiently • Increases efficiency Apriori :Scanning of data is required again and again. FP growth : uses more complex data structures and mining techniques. Extra memory is required. Multihash : many reads of disks and thus decreases efficiency

Brief description • During Pass I of A-priori, most memory is idle. • Use that memory to keep counts of buckets into which pairs of items are hashed. • Gives extra condition that candidate pairs must satisfy on Pass 2 • Pass II: • Count all pairs {i,j} that meet the conditions. • Pair {i,j} hashes to bucket no. whose bit in bit vector is 1 • These conditions are necessary for the pair to have a chance of being frequent

Pros • Pros: • Easy to implement. • Preferred whenever there is main memory constraint.

Result • With support threshold at 2%, our algorithm took around 220 seconds and produced singles, pairs and triplets. • In-built Java hash functions i. e MD5, MD4, SHA-256 don’t work with our memory restriction

REFERENCES • Mining of Massive Datasets By Anand Rajaraman and Jeff Ullman. • http://www.cs.ucy.ac.cy/~dzeina/courses/epl446/lectures/09.pdf • http://faculty.simpson.edu/lydia.sinapova/www/cmsc250/LN250_Weiss/L17-ExternalSortEX1.htm

Thank YOU !

COMP – 6521 Advance database SYSTEMS and applications