1 / 18

COMP – 6521 Advance database SYSTEMS and applications

COMP – 6521 Advance database SYSTEMS and applications. Project Presentation. Submitted To: Submitted By: Professor Gosta Grahane Kulbir Sandhu Priyanka Shukla

mimis
Download Presentation

COMP – 6521 Advance database SYSTEMS and applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COMP – 6521 Advance database SYSTEMS and applications Project Presentation Submitted To: Submitted By: Professor Gosta Grahane Kulbir Sandhu Priyanka Shukla Group - 16

  2. Project - 1 External Sorting

  3. Problem Statement • Build an application that will sort data (integer data) in ascending order using External Sorting. • Input :-A Data file containing large number of integers. • Output:- A file containing all the numbers sorted in ascending order. • Restrictions: • Language restricted to java • Application should work with data of any size.

  4. Our Approach • Using External Sorting. • Phase I :- • Used buffer reader to read every sublist. • create sorted sublist of size 4MB using quicksort algorithm (Simple and efficient) • Phase II :- • create n input buffers where n = number of sorted sublist and write output to disk.

  5. Our Approach (contd.) • Buffer the data from all sorted sublist and choose the minimum among them at first position. Output it to output buffer. • If any input buffer gets empty, fill it again from the corresponding sublist. If sublist gets empty, let buffer remain empty. • If output buffer gets full, copy the data from output buffer to final sorted output file and flush the output buffer.

  6. Pros • Fast as Arrays have been used • No loss of data. • Easy to implement

  7. RESULT • Small: 2 secs • Medium: 5 secs • Large: 12 seconds.

  8. Project – 2Mining Frequent Itemsets from Secondary Memory 

  9. Problem Statement • Build an application that will compute all size of frequent itemsets (Pairs, Triples, Quadruples etc.) from a set of transactions.  • Input:- a data file containing integers and support threshold. • Output:- a file containing each frequent itemset (Pairs, Triplets, Quadruples etc) with their support.

  10. Restrictions • language is restricted to Java. • One disk and main memory usage is limited to 5 MB. • Short data type should not be used.

  11. Naive Approach • Using Triangular Matrix or Triples scheme, we could have easily find pairs. • Problem: This Approach does not work as we do have a main memory restriction. In addition the data file contains many transactions

  12. Algorithms Considered: • A-Priori Algorithm. • PCY Algorithm. • Multihash Algorithm.

  13. Why Pcy • Improvement to Apriori • Uses memory efficiently • Increases efficiency Apriori :Scanning of data is required again and again. FP growth : uses more complex data structures and mining techniques. Extra memory is required. Multihash : many reads of disks and thus decreases efficiency

  14. Brief description • During Pass I of A-priori, most memory is idle. • Use that memory to keep counts of buckets into which pairs of items are hashed. • Gives extra condition that candidate pairs must satisfy on Pass 2 • Pass II: • Count all pairs {i,j} that meet the conditions. • Pair {i,j} hashes to bucket no. whose bit in bit vector is 1 • These conditions are necessary for the pair to have a chance of being frequent

  15. Pros • Pros: • Easy to implement. • Preferred whenever there is main memory constraint.

  16. Result • With support threshold at 2%, our algorithm took around 220 seconds and produced singles, pairs and triplets. • In-built Java hash functions i. e MD5, MD4, SHA-256 don’t work with our memory restriction

  17. REFERENCES • Mining of Massive Datasets By Anand Rajaraman and Jeff Ullman. • http://www.cs.ucy.ac.cy/~dzeina/courses/epl446/lectures/09.pdf • http://faculty.simpson.edu/lydia.sinapova/www/cmsc250/LN250_Weiss/L17-ExternalSortEX1.htm

  18. Thank YOU !

More Related