700 likes | 855 Views
Big Data Analysis Technology. University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June 12, 2013 Tobias Hardes (6687549) – Tobias.Hardes@gmail.com. Table of content. Introduction Definitions Background Example
E N D
Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June 12, 2013 Tobias Hardes (6687549) – Tobias.Hardes@gmail.com
Table of content • Introduction • Definitions • Background • Example • Related Work • Research • Main Approaches • Association Rule Mining • MapReduce Framework • Conclusion
Big Data vs. Business Intelligence • How can we predict cancer early enough to treat it successfully? • How Can I make significant profit on the stock market next month? Docs.oralcle.com • Which is the most profitable branch of our supermarket? • In a specific country? • During a specific period of time
Background home.web.cern.ch
Big Science – The LHC • 600 million times per second, particles collide within the Large Hadron Collider (LHC) • Each collision generate new particles • Particles decay in complex way • Each collision is detected • The CERN Data Center reconstruct this collision event • 15 petabytes of data stored every year • Worldwide LHC Computing Grid (WLCG) is used to crunch all of the data home.web.cern.ch
Data Stream Analysis • Just in time analysis of data. • Sensor networks • Analysis for a certain time (last 30 seconds) http://venturebeat.com
Complexeventprocessing (CEP) • Provides queries for streams • Usage of „Event Processing Languages“ (EPL) • selectavg(price)fromStockTickEvent.win:time(30 sec) Tumbling Window (Slide = WindowSize) Sliding Window (Slide < WindowSize) Window Slide https://forge.fi-ware.eu
Complex Event Processing - Areas of application • Just in time analysis Complexity of algorithms • CEP is used with Twitter: • Identify emotional states of users • Sarcasm?
Principles • Statistics • Probability theory • Machine learning • Data Mining • Association rule learning • Cluster analysis • Classificiation
AssociationRule Mining – Cluster analysis Association Rule Mining Is soda purchased with bananas? • Relationships between items • Find associations, correlations or causal structures • Apriori algorithm • Frequent Pattern (FP)-Growth algorithm
Cluster analysis – Classification Cluster Analysis • Classification of similar objects into classes • Classes are defined during the clustering • k-Means • K-Means++
Research andfuturework • Performance, performance, performance… • Passes of the data source • Parallelization • NP-hard problems • …. • Accuracy • Optimized solutions
Example • Apriori algorithm: n+1 database scans • FP-Growth algorithm: 2 database scans
Distributed computing – Motivation • Complex computational tasks • Serveralterabytes of data • Limited hardware resources • Google‘sMapReduce framework Prof. Dr. Erich Ehses (FH Köln)
Main approaches http://ultraoilforpets.com
Structure • Association rule mining • Apriori algorithm • FP-Growth algorithm • Googles MapReduce
Association rule mining • Identify items that are related to other items • Example: Analysis of baskets in an online shop or in a supermarket http://img.deusm.com/
Terminology • A stream or a database with n elements: S • Item set: • Frequency of occurrence of an item set: Φ(A) • Association rule B : • Support: • Confidence:
Example • Rule: „If a basket contains cheese and chocolate, then it also contains bread“ • 6 of 60 transactions contains cheese and chocolate • 3 of the 6 transactions contains bread
Common approach • Disjoin the problem into two tasks: • Generation of frequent item sets • Find item sets that satisfy a minimum support value • Generation of rules • Find Confidence rules using the item sets
Aprioalgorithm – Frequent item set • Input: • Minimum support: min_sup • Datasource: S
Apriori – Frequent item sets (I) • Generation of frequent item sets : min_sup = 2 2 2 2 A B C 1 3 D 1 1 1 4 4 2 4 2 3 3 2 https://www.mev.de/ {}
Apriori – Frequent item sets (II) • Generation of frequent item sets : min_sup = 2 2 ACD BCD 1 Candidates L3 Candidates L2 AB 1 AC 2 AD 2 BC 3 BD CD 2 2 A 2 B 4 C 4 D 3 L1 https://www.mev.de/ {}
Apriori Algorithm – Rulegeneration • Uses frequent item sets to extract high-confidence rules • Based on the same principle as the item set generation • Done for all frequent item set Lk
Summary Apriori algorithm • n+1 scansofthedatabase • Expensive generationofthecandidate item set • Implements level-wise search using frequent item property. • Easy toimplement • Someopportunities for specialized optimizations
FP-Growth algorithm • Used for databases • Features: • Requires 2 scans of the database • Uses a special data structure – The FP-Tree • Build the FP-Tree • Extract frequent item sets • Compression of the database • Devide this database and apply data mining
Extractfrequentitemsets (I) • Bottom-up strategy • Start with node „e“ • Then look for „de“ • Each path is processedrecursively • Solutions are merged
Extractfrequentitemsets (II) • Is e frequent? • Is de frequent? • … • Is ce frequent? • …. • Is be frequent? • …. • Is ae frequent? • ….. • Using subproblems to identify frequent itemsets Φ(e) = 3 – Assume the minimum support was set to 2
Extractfrequentitemsets (III) Update the support count along the prefix path Remove Node e Check the frequency of the paths Find item setswith de, ce, aeorbe
Apriori vs. FP-Growth • FP-Growth hassomeadvantages • Twoscansofthedatabase • No expensive computationofcandidates • Compressed datastructure • Easiertoparallelize W. Zhang, H. Liao, and N. Zhao, “Research on the fp growth algorithm about association rule mining
MapReduce • MapandReducefunctionsareexpressedby a developer • map(key, val) • Emitsnewkey-values p • reduce(key, values) • Emits an arbitraryoutput • Usually a keywithonevalue
User Programm (1)fork (7) return (1)fork (1)fork Master (2) assign (2) assign (4) localwrite (5) RPC (3) read worker Worker forbluekeys worker (6) write worker worker worker Worker forredkeys worker worker worker Worker foryellowkeys Map phase Intermediate files Reduce phase Input files Shuffle Output files
Conclusion: MapReduce (I) • MapReduceis design as a batchprocessingframework • Nousagefor ad-hoc analysis • Usedforvery large datasets • Usedfortime intensive computations • OpenSourceimplementation: Apache Hadoop http://hadoop.apache.org/
Conclusion (I) • Big Data is important for research and in daily business • Different approaches • Data Stream analysis • Complex event processing • Rule Mining • Apriori algorithm • FP-Growth algorithm
Conclusion (II) • Clustering • K-Means • K-Means++ • Distributed computing • MapReduce • Performance / Runtime • Multiple minutes • Hours • Days… • Online analytical processing for Big Data?
Big Data definitions Every day, we create 2.5 quintillion bytes of …. . This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data. (IBM Corporate ) Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making. (Gartner Inc.) Big data” refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze. (McKinsey & Company)
Big Data definitions Every day, we create 2.5 quintillion bytes of …. . This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data. (IBM Corporate ) Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making. (Gartner Inc.) Big data” refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze. (McKinsey & Company)
Complex Event Processing – Windows Tumbling Window Sliding Window Slides in time Buffers the last x elements • Moves as much as the window size Tumbling Window (Slide = WindowSize) Sliding Window (Slide < WindowSize) Window Slide
Apriori Algorithm (Pseudocode) • for ( • for each do • foreachdo • end for • end for • ifthen • end if • end for • return
Apriori Algorithm (Pseudocode) • for ( • for each do • foreachdo • end for • end for • ifthen • end if • end for • return