620 likes | 1k Views
Data Mining and Scalability. Lauren Massa-Lochridge Nikolay Kojuharov Hoa Nguyen Quoc Le. Outline. Data Mining Overview Scalability Challenges & Approaches. Overview – Association rules. Case study - BIRCH – An Efficient Data Clustering Method for VLDB.
E N D
Data Mining and Scalability Lauren Massa-Lochridge Nikolay Kojuharov Hoa Nguyen Quoc Le
Outline • Data Mining Overview • Scalability Challenges & Approaches. • Overview – Association rules. • Case study - BIRCH – An Efficient Data Clustering Method for VLDB. • Case Study – Scientific Data Mining. • Q&A
Data Mining: Rationale • Data size • Data in databases is estimated to double every year. • Number of people who look at the data stays constant • Complexity • The analysis is complex. • The characteristics and relationships are often unexpected and unintuitive. • Knowledge discovery tools and algorithms are needed to make sense and use of data
Data Mining: Rationale (cont’d) • As of 2003, France Telecom has largest decision-support DB, ~30 TB; AT&T was 2nd with 26 TB database. • Some of the largest databases on the Web, as of 2003, include • Alexa (www.alexa.com) internet archive: 7 years of data, 500 TB • Internet Archive (www.archive.org),~ 300 TB • Google, over 4 Billion pages, many, many TB • Applications • Business – analyze inventory, predict customer acceptance, etc. • Science – find correlation between genes and diseases, pollution and global warming, etc. • Government – uncover terrorist networks, predict flu pandemic, etc. Adapted from: Data Mining, and Knowledge Discovery: An Introduction, http://www.kdnuggets.com/dmcourse/other_lectures/intro-to-data-mining-notes.htm
Data Mining: Definition • Semi-automatic discovery of patterns, changes, anomalies, rules, and statistically significant structures and events in data. • Nontrivial extraction of implicit, previously unknown, and potentially useful information from data • Data mining is often done on targeted, preprocessed, transformed data. • Targeted: data fusion, sampling. • Preprocessed: Noise removal, feature selection, normalization. • Transformed: Dimension reduction.
Data Mining: Evolution Adapted from: An Introduction to Data Mining, http://www.thearling.com/text/dmwhite/dmwhite.htm
Data Mining: Approaches • Clustering - identify natural groupings within the data. • Classification - learn a function to map a data item into one of several predefined classes. • Summarization – describe groups, summary statistics, etc. • Association – identify data items that occur frequently together. • Prediction – predict values or distribution of missing data. • Time-series analysis – analyze data to find periodicity, trends, deviations.
Scalability & Performance • Scaling and performance are often considered together in Data Mining. The problem of scalability in DM is not only how to process such large sets of data, but how to do it within a useful timeframe. • Many of the issues of scalability in DM and DBMS are similar to scaling performance issues for Data Management in general. • Dr. Gregory Piatetsky-Shapiro & Prof. Gary Parker, (P&P) define that the main issue for a clustering algorithms in general as an approach to DM is: “The main issue in clustering is how to evaluate the quality of potential grouping. There are many methods, ranging from manual, visual inspection to a variety of mathematical measures that minimize the similarity of items within the cluster and maximize the difference between the clusters."
Common DM Scaling Problem Algorithms generally: • Operate on data with assumption of in-memory processing of entire data set • Operate under assumption that KIWI will be used to address I/O and other performance scaling issues • Or just don't address scalability within resource constraints at all
Data Mining: Scalability • Large Datasets • Use scalable I/O architecture - minimize I/O, make it fit, make it fast. • Reduce data - aggregation, dimensional reduction, compression, discretization. • Complex Algorithms • Reduce algorithm complexity • Exploit parallelism, use specialized hardware • Complex Results • Effective visualization • Increase understanding, trustworthiness
Scalable I/O Architecture • Shared memory parallel computers: local + global memory. Locking is used to synchronize. • Distributed memory parallel computers: Message Passing/ Remote Memory Operations. • Parallel Disk: B records – 1 unit. D blocks can be read or written at once. • Primitives: Scatter, Gather, Reduction • Data parallelism or Task parallelism
Scaling – General Approaches • Question: How can we make tackle memory constraints and efficiency? • Statistics: Manipulate data to fits into memory – sampling, selecting features, partition, summarization. • Database: Reduce the time to access out of memory data – specialized data structures, block reads, parallel block reads. • High Performance Computing: Use several processors • Data Mining Imp: Efficient DM primitives, Pre-compute • Misc.: Reduce the amount of data - Discretization, Compression, Transformation.
Association rules • Beer-Diapers example. • Basket data analysis, Cross-market, sale-campaigns, Web-log analysis etc. • Introduced for the first time in 1993 • Mining Association Rules between Sets of Items in Large Databases by R. Agrawal et al., SIGMOD 93 Conference.
AR Applications X ==> Y: • What are the interesting applications? • find all rules with “bagels” as X? • what should be shelved together with bagels? • what would be impacted if stop selling bagels? • find all rules with “Diet Coke” as Y? • what the store should do to promote Diet Coke? • find all rules relating any items in Aisles 1 and 2? • shelf planning to see if the two aisles are related
AR: Input & Output • Input: • a database of sales “transactions” • parameters: • minimal support: say 50% • minimal confidence: say 100% • Output: • rule 1: {2, 3} ==> {5}: s =?, c =? • rule 2: {3, 5} ==> {2} • more … … TID Items 100 1 3 4 200 2 3 5 300 1 2 3 5 400 2 5
Apriori: A Candidate Generation-and-Test Approach • Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94) • Method: • Initially, scan DB once to get frequent 1-itemset • Generate length (k+1) candidate itemsets from length k frequent itemsets • Test the candidates against DB • Terminate when no frequent or candidate set can be generated
Scaling Attempts • Count Distribution: distribute transaction data among processors and count the transactions in parallel. scale linearly with # of transactions • Savasere et. al. (VLDB95): Partition data and scan twice (local, global). • Toivonen (VLDB96): Sampling – Verification of closure borders. • Brin et. al. (SIGMOD97): Dynamic itemset counting. • Pei & Han (SIGMOD00): Compact Description (FP-tree), no candidate generation. Scale up with partition-based projection..
BIRCH Approach • Informal definition: "data clustering identifies the sparse and the crowded places and hence discovers the overall distribution patterns of the data set." • Hierarchical clustering utilizing a distance measure is the catagory of clustering algorithm that BIRCH uses, K-Means is an example of distance measure. • Approach: "statistical identification of clusters, i.e. densely populated regions in multi-dimensional dataset, given the desired number of clusters K. and a dataset of N points, and a distance based measurement”. • Problem with other approaches, distance measure, hierarchical, etc. are all similar in terms of scaling and resource utilization.
BIRCH Novelty • First algorithm proposed in the database areathat filters out “noise”, i.e. outliers • Prior work does not adequately address large data sets with minimization of I/O cost • Prior work does not address issues of data set fit to memory • Prior work does not address resource utilization or resource constraints in scalability and performance
Database / DM Oriented Constraints • Resource utilization is maximizing usage of available resources as opposed to just working within resources constraints alone, which does not necessarily optimize utilization. • Resource utilization is important in DM Scaling or for any case where the data sets are very large. • BIRCH single scan of data set yields a minimum of “good enough” clustering. • One or more additional passes are optional and depending upon specifics of constraints for a particular system and application, can be used to improve the quality over and above the "good enough" .
Database Oriented Constraints • Database Oriented Constraints are what differentiates BIRCH from more general DM algorithms • Limited acceptable response time • Resource Utilization – optimize not just work within resources available – necessary for VeryLarge data sets • Fit to available memory • Minimize I/O costs • Need I/O cost linear in size of data set
Features of BIRCH Solution: • Locality of reference: each unit clustering decision made without scanning all data points for all existing clusters. • Clustering decision: measurements reflect natural "closeness" of points • Locality enables incrementally maintained and updated during clustering process • Optional removal of outliers: • Cluster equals dense region of points. • Outlier equals point in sparse region.
More Features of BIRCH Solution: • Optimal memory resource usage -> Utilization and and within Resource Constraints. • Finest possible sub clusters, given memory resource and I/O/time constraints: • Finest clusters given memory implies best accuracy achievable (another type of optimal utilization). • Minimize I/O costs: • implies efficiency and required response time.
More Features of BIRCH Solution • Running time linearly scalable (in size of data set). • Optionally, incremental scan of data set, i.e. do not have to scan entire data said in advance and increments adjustable. • Only scans complete data set once (others scan multiple times)
Background (Single cluster) • Given N d-dimensional data points : {Xi} • “Centroid” • “radius” • “diameter”
Background (two clusters) Given the centroids : X0 and Y0, • The centroid Euclidean distance D0: • The centroid Manhattan distance D1:
Background ( two clusters) • Average inter-cluster distance D2= • Average intra-cluster distance D3=
Clustering Feature • CF = (N, LS, SS) N = |C| “number of data points” LS = “linear sum of N data points” SS = “square sum of N data points ” Summarization of cluster
CF AdditiveTheorem • Assume CF1=(N1, LS1 ,SS1), CF2 =(N2,LS2,SS2) . • Information stored in CFs is sufficient to compute: • Centroids • Measures for the compactness of clusters • Distance measure for clusters
CF-Tree • height-balanced tree • two parameters: • branching factor • B : An internal node contains at most B entries [CFi, childi] • L : A leaf node contains at most L entries [CFi] • threshold T • The diameter of all entries in a leaf node is at most T • Leaf nodes are connected via prev and next pointers efficient for data scan
BIRCH Algorithm Scaling Details CF / CF Tree used to optimize clusters for memory & I/O: • P, page size (page of memory) • Tree size a function of T, larger T -> smaller CF Tree • Require node to fit in memory page size P –> split to fit, or merge for optimal utilization - dynamically • P can be varied on the system or in the algorithm for performance tuning and scaling
Phase 1 Start CF tree t1 of initial T Continue scanning data and insert into t1 Out of memory Finish scanning data Result? • increase T • rebuild CF tree t2 of new T from CF tree t1. if a leaf entry is a • potential outlier, write to disk. Otherwise use it. • t1 <= t2 Otherwise Out of disk space Result? Re-absorb potential outliers into t1 Re-absorb potential outliers into t1
Analysis • I/O cost: Where • N: Number of data points • M: Memory size • P: Page size • d: dimension • N0: Number of data points loaded into memory with threshold T0
Terminologies • Data mining: The semi-automatic discovery of patterns, associations, anomalies, and statistically significant structures of data. • Pattern recognition: The discovery and characterization of patterns • Pattern: An ordering with an underlying structure • Feature: Extractable measurement or attribute
Scientific data mining Figure 1. Key steps in scientific data mining
Data mining is essential • Scientific data set is very complex • Multi-sensor, multi-resolution,multi-spectral data • High-dimentional data • Mesh data from simulation • Data contaminated with noise • Sensor noise, clouds, atmospheric turbulence, …
Data mining is essential • Massive dataset • Advances in technology allows us to collect ever increasing amount of scientific data (in experiments, observations, and simulations) • Astronomies dataset with tens of millions of galaxies • Sloan Digital Sky Survey: Assuming the pixel size of about 0.25”, the whole sky is 10Tera pixels (2 bytes/pixel and 1TeraByte) • Collection of data made possible by advances in: • Sensors (telescopes, satellites, …) • Computers and storages (faster, parallel, …) We need fast and accurate data analysis techniques to realize the full potential of our enhanced data collecting ability. And manual techniques are impossible
Data mining in astronomy • FIRST: Detecting radio-emitted stars • Dataset: 100GByte of image data (1996) Image Map 16K image maps, 7.1MB each
Data mining in astronomy • Example • Result: Find 20K radio-emitted stars from 400K entries
Mining climate data (Univ. of Minnesota) Research Goal: • Find global climate patterns of interest to Earth Scientists Average Monthly Temperature A key interest is finding connections between the ocean / atmosphere and the land. • Global snapshots of values for a number of variables on land surfaces or water. • Span a range of 10 to 50 years.
Mining climate data (Univ. of Minnesota) • EOS satellites provide high resolution measurements • Finer spatial grids • 8 km 8 km grid produces 10,848,672 data points • 1 km 1 km grid produces 694,315,008 data points • More frequent measurements • Multiple instruments • Generates terabytes of day per day SCALABILITY Earth Observing System (e.g., Terra and Aqua satellites)