1 / 66

High Performance Data Mining

High Performance Data Mining. Vipin Kumar Army High Performance Computing Research Center Department of Computer Science University of Minnesota http://www.cs.umn.edu/~kumar Research sponsored by AHPCRC/ARL, DOE, NASA, and NSF. Overview. Introduction to Data Mining (What, Why, and How?)

siran
Download Presentation

High Performance Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High Performance Data Mining Vipin Kumar Army High Performance Computing Research Center Department of Computer Science University of Minnesota http://www.cs.umn.edu/~kumar Research sponsored by AHPCRC/ARL, DOE, NASA, and NSF

  2. Overview • Introduction to Data Mining (What, Why, and How?) • Issues and Challenges in Designing Parallel Data Mining Algorithms • Case Study: Discovery of Patterns in Global Climate Data using Data Mining • Summary

  3. What is Data Mining? • Many Definitions • Non-trivial extraction of implicit, previously unknown and potentially useful information from data • Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns

  4. What is (not) Data Mining? • What is not Data Mining? • Look up phone number in phone directory • Query a Web search engine for information about “Amazon” • What is Data Mining? • Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area) • Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)

  5. Why Mine Data? Commercial Viewpoint • Lots of data is being collected and warehoused • Web data, e-commerce • purchases at department/grocery stores • Bank/Credit Card transactions • Computers have become cheaper and more powerful • Competitive Pressure is Strong • Provide better, customized services for an edge (e.g. in Customer Relationship Management)

  6. Why Mine Data? Scientific Viewpoint • Data collected and stored at enormous speeds (GB/hour) • remote sensors on a satellite • telescopes scanning the skies • microarrays generating gene expression data • scientific simulations generating terabytes of data • Traditional techniques infeasible for raw data • Data mining may help scientists • in classifying and segmenting data • in Hypothesis Formation

  7. From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications” Mining Large Data Sets - Motivation • There is often information “hidden” in the data that is not readily evident • Human analysts may take weeks to discover useful information • Much of the data is never analyzed at all The Data Gap Total new disk (TB) since 1995 Number of analysts

  8. Origins of Data Mining • Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems • Traditional Techniquesmay be unsuitable due to • Enormity of data • High dimensionality of data • Heterogeneous, distributed nature of data Statistics/AI Machine Learning/ Pattern Recognition Data Mining Database systems

  9. Statistics/AI Machine Learning/ Pattern Recognition Data Mining High Performance Computing Database systems Role of Parallel and Distributed Computing • Many algorithms use computation time more than O(n) • High Performance Computing (HPC) is often critical for scalability to large data sets • Sequential computers have limited memory • This may required multiple,expensive I/O passes over data • Data may be distributed • due to privacy reasons • physically dispersed over many different geographic locations

  10. Data Mining Tasks... Data Clustering Predictive Modeling Anomaly Detection Association Rules Milk

  11. Predictive Modeling • Find a model for class attribute as a function of the values of other attributes Model for predicting tax evasion categorical categorical continuous Married class No Yes NO Income100K Yes Yes Income  80K NO Yes No Learn Classifier NO YES

  12. Predictive Modeling: Applications • Targeted Marketing • Customer Attrition/Churn • Classifying Galaxies • Class: • Stages of Formation Early • Attributes: • Image features, • Characteristics of light waves received, etc. Intermediate Late • Sky Survey Data Size: • 72 million stars, 20 million galaxies • Object Catalog: 9 GB • Image Database: 150 GB Courtsey: http://aps.umn.edu

  13. Clustering • Given a set of data points, find groupings such that • Data points in one cluster are more similar to one another • Data points in separate clusters are less similar to one another

  14. Clustering: Applications • Market Segmentation • Gene expression clustering • Document Clustering

  15. Association Rule Discovery • Given a set of records, find dependency rules which will predict occurrence of an item based on occurrences of other items in the record • Applications • Marketing and Sales Promotion • Supermarket shelf management • Inventory Management Rules Discovered: {Milk} --> {Coke} (s=0.6, c=0.75) {Diaper, Milk} --> {Beer} (s=0.4, c=0.67)

  16. Deviation/Anomaly Detection • Detect significant deviations from normal behavior • Applications: • Credit Card Fraud Detection • Network Intrusion Detection Typical network traffic at University level may reach over 100 million connections per day

  17. General Issues and Challenges in Parallel Data Mining • Dense vs. Sparse • Structured versus Unstructured • Static vs. Dynamic • Data mining computations tend to be unstructured, sparse and dynamic.

  18. Specific Issues and Challenges in Parallel Data Mining • Disk I/O • Data is often too large to fit in main memory • Spatial locality is critical • Hash Tables • Many efficient data mining algorithms require fast access to large hash tables.

  19. Pay Evade Refund 3 0 No Refund 4 3 Constructing a Decision Tree Marital Status Refund Single/Divorced Married Yes No Pay: 3 Evade:3 Pay: 4 Evade:0 Pay: 3 Evade:0 Pay: 4 Evade:3 Key Computation

  20. Constructing a Decision Tree Refund: Yes Refund: No

  21. Partitioning of data only global reduction per node is required large number of classification tree nodes gives high communication cost Pay Evade Refund 3 0 No Refund 4 3 Constructing a Decision Tree in Parallel m categorical attributes n records

  22. 10,000 training records 7,000 records 3,000 records 2,000 5,000 2,000 1,000 Constructing a Decision Tree in Parallel • Partitioning of classification tree nodes • natural concurrency • load imbalance as the amount of work associated with each node varies • child nodes use the same data as used by parent node • loss of locality • high data movement cost

  23. Challenges in Constructing Parallel Classifier • Partitioning of data only • large number of classification tree nodes gives high communication cost • Partitioning of classification tree nodes • natural concurrency • load imbalance as the amount of work associated with each node varies • child nodes use the same data as used by parent node • loss of locality • high data movement cost • Hybrid algorithms: partition both data and tree

  24. Experimental Results(Srivastava, Han, Kumar, and Singh, 1999) • Data set • function 2 data set discussed in SLIQ paper (Mehta, Agrawal and Rissanen, EDBT’96) • 2 class labels, 3 categorical and 6 continuous attributes • IBM SP2 with 128 processors • 66.7 MHz CPU with 256 MB real memory • AIX version 4 • high performance switch

  25. Speedup Comparison of the Three Parallel Algorithms 0.8 million examples 1.6 million examples

  26. Splitting Criterion Verification in the Hybrid Algorithm 0.8 million examples on 8 processors 1.6 million examples on 16 processors

  27. Speedup of the Hybrid Algorithm with Different Size Data Sets

  28. Scaleup of the Hybrid Algorithm

  29. Hash Table Access • Some efficient decision tree algorithms require random access to large data structures. • Example: SPRINT (Shafer, Agrawal, Mehta) Hash Table Storing the entire has table on one processor makes the algorithm unscalable.

  30. ScalParC (Joshi, Karypis, Kumar, 1998) • ScalParC is a scalable parallel decision tree construction algorithm • Scales to large number of processors • Scales to large training sets • ScalParC is memory efficient • The hash-table is distributed among the processors • ScalParC performs minimum amount of communication

  31. This Design is Inspired by.. • Communication Structure of Parallel Sparse Matrix-Vector Algorithms.

  32. Parallel Runtime(Joshi, Karypis, Kumar, 1998) 128 Processor Cray T3D

  33. Computing Association Patterns 2. Find item combinations (itemsets) that occur frequently in data 1. Market-basket transactions 3. Generate association rules

  34. Computing Association Require Exponential Computation {a} {b} {c} {d} {a,b} {a,c} {a,d} {b,c} {b,d} {c,d} {a,b,c} {a,b,d} {a,c,d} {b,c,d} {a,b,c,d} Given m items, there are 2m-1 possible item combinations

  35. Handling Exponential Complexity • Given n transactions and m different items: • number of possible association rules: • computation complexity: • Systematic search for all patterns, based on support constraint [Agarwal & Srikant]: • If {A,B} has support at least a, then both A and B have support at least a. • If either A or B has support less than a, then {A,B} has support less than a. • Use patterns of n-1 items to find patterns of n items.

  36. Illustrating Apriori Principle(Agrawal and Srikant, 1994) Items (1-itemset candidates) Pairs (2-itemset candidates) Minimum Support = 3 Triplets (3-itemset candidates) If every subset is considered, 6C1 + 6C2 + 6C3 = 41 With support-based pruning, 6 + 6 + 1 = 13

  37. Counting Candidates • Frequent Itemsets are found by counting candidates. • Simple way: • Search for each candidate in each transaction. Expensive!!! Transactions Candidates M N

  38. Parallel Formulation of Association Rules(Han, Karypis, and Kumar, 2000) • Need: • Huge Transaction Datasets (10s of TB) • Large Number of Candidates. • How? • Partition the Transaction Database among processors • communication needed for global counts • local memory on each processor should be large enough to store the entire hash tree • Partition the Candidates among processors • redundant I/O for transactions • Partition both Candidates and Transaction Database

  39. Parallel Association Rules: Scaleup Results (100K,0.25%)(Han, Karypis, and Kumar, 2000)

  40. Parallel Association Rules: Response Time (np=64,50K) (Han, Karypis, and Kumar, 2000)

  41. Discovery of Patterns in the Global Climate System Research Goals: • Find global climate patterns of interest to Earth Scientists • Global snapshots of values for a number of variables on land surfaces or water. • Monthly over a range of 10 to 50 years. # grid points: 67K Land, 40K Ocean Current data size range: 20 – 400 MB

  42. Importance of Global Climate Patterns and NPP • Net Primary Production (NPP) is the net assimilation of atmospheric carbon dioxide (CO2) into organic matter by plants. • Keeping track of NPP is important because it includes the food source of humans and all other organisms. • NPP is impacted by global climate patterns. Image from http://www.pmel.noaa.gov/co2/gif/globcar.png

  43. Patterns of Interest • Zone Formation • Find regions of the land or ocean which have similar behavior. • Associations • Find relations between climate events and land cover. • Teleconnections • Teleconnections are the simultaneous variation in climate and related processes over widely separated points on the Earth. • El Nino associated with droughts in Australia and Southern Africa and heavy rainfall along the western coast of South America. Sea Surface Temperature Anomalies off Peru (ANOM 1+2)

  44. Clustering of Raw NPP and Raw SST(Num clusters = 2)

  45. K-Means Clustering of Raw NPP and Raw SST (Num clusters = 2) Land Cluster Cohesion: North = 0.78 South = 0.59 Ocean Cluster Cohesion: North = 0.77 South = 0.80

  46. © V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 46 Ocean Climate Indices: Connecting the Ocean and the Land • An OCI is a time series of temperature or pressure • Based on Sea Surface Temperature (SST) or Sea Level Pressure (SLP) • OCIs are important because • They distill climate variability at a regional or global scale into a single time series. • They are related to well-known climate phenomena such as El Niño.

  47. Ocean Climate Indices – ANOM 1+2 • ANOM 1+2 is associated with El Niño and La Niña. • Defined as the Sea Surface Temperature (SST) anomalies in a regions off the coast of Peru • El Nino is associated with • Droughts in Australia and Southern Africa • Heavy rainfall along the western coast of South America • Milder winters in the Midwest El Nino Events

  48. Connection of ANOM 1+2 to Land Temp OCIs capture teleconnections, i.e., the simultaneous variation in climate and related processes over widely separated points on the Earth.

  49. Iceland Azores Ocean Climate Indices - NAO • The North Atlantic Oscillation (NAO) is associated with climate variation in Europe and North America. • Normalized pressure differences between Ponta Delgada, Azores and Stykkisholmur, Iceland. • Associated with warm and wet winters in Europe and in cold and dry winters in northern Canada and Greenland • The eastern US experiences mild and wet winter conditions.

  50. Connection of NAO to Land Temp

More Related