800 likes | 834 Views
Intro to Data Mining. Natasha Balac, Ph.D. Predictive Analytics Center of Excellence, Director San Diego Supercomputer Center University of California, San Diego. Necessity is the Mother of Invention. Problem Data explosion
E N D
Intro to Data Mining Natasha Balac, Ph.D. Predictive Analytics Center of Excellence, Director San Diego Supercomputer Center University of California, San Diego
Necessity is the Mother of Invention • Problem • Data explosion • Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories • “We are drowning in data, but starving for knowledge!” (John Naisbitt, 1982)
Necessity is the Mother of Invention • Solution • Predictive Analytics or Data Mining • Extraction or “mining” of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases • Data -driven discovery and modeling of hidden patterns in large volumes of data • Extraction of implicit, previously unknown and unexpected, potentially extremely useful information from data
Surface SQL tools for simple queries and reporting Shallow Statistical & OLAPtools for summaries and analysis Hidden Data Mining methods for knowledge discovery Bottom-Up Methodology Predictive Analytics Analytical Tools Top-Down Methodology Data
DM Enables Predictive Analytics Predictive Analytics Role of Software Data mining Proactive Interactive OLAP Ad-hoc reporting Canned reporting Passive BusinessInsight Presentation Exploration Discovery
What Is Data Mining? • Data mining: • Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases
Data Mining is NOT • Data Warehousing • (Deductive) query processing • SQL/ Reporting • Software Agents • Expert Systems • Online Analytical Processing (OLAP) • Statistical Analysis Tool • Data visualization
Machine Learning Techniques • Technical basis for data mining: algorithms for acquiring structural descriptions from examples • Methods originate from artificial intelligence, statistics, and research on databases
Multidisciplinary Field Database Technology Statistics Data Mining Machine Learning Visualization Artificial Intelligence Other Disciplines
Multidisciplinary Field • Database technology • Artificial Intelligence • Machine Learning including Neural Networks • Statistics • Pattern recognition • Knowledge-based systems/acquisition • High-performance computing • Data visualization
History • Emerged late 1980s • Flourished –1990s • Roots traced back along three family lines • Classical Statistics • Artificial Intelligence • Machine Learning
Statistics • Foundation of most DM technologies • Regression analysis, standard distribution/deviation/variance, cluster analysis, confidence intervals • Building blocks • Significant role in today’s data mining – but alone is not powerful enough
Artificial Intelligence • Heuristics vs. Statistics • Human-thought-like processing • Requires vast computer processing power • Supercomputers
Machine Learning • Union of statistics and AI • Blends AI heuristics with advanced statistical analysis • Machine Learning – let computer programs • learn about data they study - make different decisions based on the quality of studied data • using statistics for fundamental concepts and adding more advanced AI heuristics and algorithms
Data Mining • Adoption of the Machine learning techniques to the real world problems • Union: Statistics, AI, Machine learning • Used to find previously hidden trends or patterns • Finding increasing acceptance in science and business areas which need to analyze large amount of data to discover trends which could not be found otherwise
Terminology • Gold Mining • Knowledge mining from databases • Knowledge extraction • Data/pattern analysis • Knowledge Discovery Databases or KDD • Information harvesting • Business intelligence
Explores Your Data Performs Predictions Finds Patterns What does Data Mining Do?
What can we do with Data Mining? • Exploratory Data Analysis • Predictive Modeling: Classification and Regression • Descriptive Modeling • Cluster analysis/segmentation • Discovering Patterns and Rules • Association/Dependency rules • Sequential patterns • Temporal sequences • Deviation detection
CRISP-DM - Cross Industry Standard Process for Data Mining CRISP-DM Process Model
Data Mining Applications • Science: Chemistry, Physics, Medicine • Biochemical analysis, remote sensors on a satellite, Telescopes – star galaxy classification, medical image analysis • Bioscience • Sequence-based analysis, protein structure and function prediction, protein family classification, microarray gene expression • Pharmaceutical companies, Insurance and Health care, Medicine • Drug development, identify successful medical therapies, claims analysis, fraudulent behavior, medical diagnostic tools, predict office visits • Financial Industry, Banks, Businesses, E-commerce • Stock and investment analysis, identify loyal customers vs. risky customer, predict customer spending, risk management, sales forecasting
Data Mining Applications • Market analysis and management • Target marketing, customer relation management, market basket analysis, cross selling, market segmentation (Credit Card scoring, Personalization & Customer Profiling ) • Risk analysis and management • Forecasting, customer retention, improved underwriting, quality control, competitive analysis (Banking and Credit Card scoring) • Fraud detection and management • Fraud, waste and abuse including: illegal practices, waste, payment error, non-compliance, incorrect billing practices
Data Mining Applications • Sports and Entertainment • IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat • Astronomy • JPL and the Palomar Observatory discovered 22 quasars with the help of data mining • Campaign Management and Database Marketing
Data Mining Tasks • Concept/Class description: Characterization and discrimination • Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions; “normal” vs. fraudulent behavior • Association(correlation and causality) • Multi-dimensional interactions and associations age(X, “20-29”) ^ income(X, “60-90K”) à buys(X, “TV”) Hospital(area code) ^ procedure(X) ->claim (type) ^ claim(cost)
Data Mining Tasks • Classification and Prediction • Finding models (functions) that describe and distinguish classes or concepts for future prediction • Example: classify countries based on climate, or classify cars based on gas mileage, fraud based on claims information • Presentation: • If-THEN rules, decision-tree, classification rule, neural network • Prediction: Predict some unknown or missing numerical values
Data Mining Tasks • Cluster analysis • Class label is unknown: Group data to form new classes • Example: cluster claims or providers to find distribution patterns of unusual behavior • Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity
Data Mining Tasks • Outlier analysis • Outlier: a data object that does not comply with the general behavior of the data • Mostly considered as noise or exception, but is quite useful in fraud detection, rare events analysis • Trend and evolution analysis • Trend and deviation: regression analysis • Sequential pattern mining, periodicity analysis
KDD Process Database Data Mining Data Preparation Training Data Selection Transformation Model, Patterns Evaluation, Verification
Learning and Modeling Methods • Decision Tree Induction (C4.5, J48) • Regression Tree Induction (CART, MP5) • Multivariate Regression Tree (MARS) • Clustering (K-means, EM, Cobweb) • Artificial Neural Networks (Backpropagation, Recurrent) • Support Vector Machines (SVM) • Various other models
TAXONOMY • Predictive Methods • Use some variables to predict some unknown or future values of other variables • Descriptive Methods • Find human –interpretable patterns that describe the data • Supervised vs. Unsupervised • Labeled vs. unlabeled data
Data Mining Challenges • Computationally expensive to investigate all possibilities • Dealing with noise/missing information and errors in data • Mining methodology and user interaction • Mining different kinds of knowledge in databases • Incorporation of background knowledge • Handling noise and incomplete data • Pattern evaluation: the interestingness problem • Expression and visualization of data mining results
Data Mining Heuristics and Guide • Choosing appropriate attributes/input representation • Finding the minimal attribute space • Finding adequate evaluation function(s) • Extracting meaningful information • Model evaluation accuracy vs. overfitting
Available Data Mining Tools COTs: • IBM Intelligent Miner • SAS Enterprise Miner • Oracle ODM • Microstrategy • Microsoft DBMiner • Pentaho • Matlab • Teradata Open Source: • WEKA • KNIME • Orange • RapidMiner • NLTK • R • Rattle
Data Mining Trends • Many data mining problems involve large, complex databases, complicated modeling techniques and substantial computer processing • Scalability is very important due to the rapid growth in the amount the data and the need to build and deploy the models at rapid rates
Needs for Scalable High Performance Data Mining • The size of a database as well as factors in building, testing, validation and deploying a data mining solution • Taking advantage of parallel database/file system systems and additional CPUs • Work with more data, build more models, and improve their accuracy by simply adding additional CPUs • Build a good data mining model as quickly as possible!
Scalable High Performance Data Mining • One solution: run parallel data mining algorithms and parallel DBMSs on parallel hardware • Once the model is build – still need for parallel computing to apply the model to a large amount of data • Scoring: data mining model applied to a batch of data, record or events at the time; require that a scalable solution be deployed
Data mining applications on Gordon MathWorks Octave DM Suites Computational Packages with DM tools Others as Requested Libraries for building tools
Summary • Discovering interesting patterns from large amounts of data • CRISP-DM Industry standard • Learn from the past • High quality, evidence based decisions • Predict for the future • Prevent future instances of fraud, waste & abuse • React to changing circumstances • Current models, continuous learning
Thank you! Questions: natashab@sdsc.edu
An Introduction to the Gordon Architecture San Diego Supercomputer Center
Gordon is not about FLOPS, but … A conservative estimate puts Gordon in the top 50 on the Top 500 list
Access to Big Data Comes with a Latency Penalty Data Oasis Lustre 4PB PFS 64 I/O nodes 300 TB Intel SSD (lower is better) Quick Path Interconnect 10’s of GB QDR InfiniBand Interconnect 100’s of GB L3 Cache MB L1 Cache KB DDR3 Memory 10’s of GB L2 Cache KB (higher is better)
Access to Big Data Comes with a Latency Penalty Data Oasis Lustre 4PB PFS 64 I/O nodes 300 TB Intel SSD (lower is better) Quick Path Interconnect 10’s of GB QDR InfiniBand Interconnect 100’s of GB L3 Cache MB L1 Cache KB DDR3 Memory 10’s of GB L2 Cache KB (higher is better)
Flash Drives are a Good Fit for Data Intensive Computing . Apart from the differences between HDD and SSD it is not common to find local storage “close” to the compute. We have found this to be attractive in our Trestles cluster, which has local flash on the compute, but is used for traditional HPC applications (not high IOPS).
Gordon System Specification * May be revised
Gordon Applications Software chemistry adf amber gamess gaussian gromacs lammps namd nwchem visualization idl NCL paraview tecplot visit VTK genomics abyss blast hmmer soapdenovo velvet data mining IntelligentMiner RapidMiner RATTLE Weka libraries ATLAS BLACS fftw HDF5 Hypre SPRNG superLU distributed computing globus Hadoop MapReduce compilers/languages gcc, intel, pgi MATLAB, Octave, R PGAS (UPC) DB2, PostgreSQL
MapReduce Paradigmatic Example: string counting • Scheduler: manage threads, initiate data split and call Map • Map: count strings, output key=string & value=count • Scheduler: re-partitions keys & values • Reduce: sum up counts MR provides parallelization,concurrency, and intermediate data functions (by key&value) User defines keys & values User defined functions