680 likes | 905 Views
Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 1 —. 1. Chapter 1. Introduction. Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kind of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Technology Are Used?
E N D
Data Mining: Concepts and Techniques(3rd ed.)— Chapter 1 — 1
Chapter 1. Introduction • Why Data Mining? • What Is Data Mining? • A Multi-Dimensional View of Data Mining • What Kind of Data Can Be Mined? • What Kinds of Patterns Can Be Mined? • What Technology Are Used? • What Kind of Applications Are Targeted? • Major Issues in Data Mining • A Brief History of Data Mining and Data Mining Society • Summary
Why Data Mining? • The Explosive Growth of Data: from terabytes to petabytes • Data collection and data availability • Automated data collection tools, database systems, Web, computerized society • Major sources of abundant data • Business: Web, e-commerce, transactions, stocks, … • Science: Remote sensing, bioinformatics, scientific simulation, … • Society and everyone: news, digital cameras, YouTube
Why Data Mining? • We are drowning in data, but starving for knowledge! • “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets • Data mining will move the data age toward the coming information age. Data Mining: Concepts and Techniques
Example: Google Flu Trends • Google’s Flu Trends uses specific search terms as indicators of flu activity. It found a close relationship between the number of people who search for flu-related information and the number of people who actually have flu symptoms. A pattern emerges when all of the search queries related to flu are aggregated. Using aggregated Google search data, Flu Trends can estimate flu activity up to two weeks faster than traditional systems can. • This example shows how data mining can turn a large collection of data into knowledge that can help meet a current global challenge. Data Mining: Concepts and Techniques
Evolution of Sciences • Before 1600, empirical science • 1600-1950s, theoretical science • Each discipline has grown a theoretical component. Theoretical models often motivate experiments and generalize our understanding. • 1950s-1990s, computational science • Over the last 50 years, most disciplines have grown a third, computational branch (e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.) • Computational Science traditionally meant simulation. It grew out of our inability to find closed-form solutions for complex mathematical models. • 1990-now, data science • The flood of data from new scientific instruments and simulations • The ability to economically store and manage petabytes of data online • The Internet and computing Grid that makes all these archives universally accessible • Scientific info. management, acquisition, organization, query, and visualization tasks scale almost linearly with data volumes. Data mining is a major new challenge! • Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Comm. ACM, 45(11): 50-54, Nov. 2002
Evolution of Database Technology • 1960s: • Data collection, database creation, IMS and network DBMS • 1970s: • Relational data model, relational DBMS implementation • 1980s: • RDBMS, advanced data models (extended-relational, OO, deductive, etc.) • Application-oriented DBMS (spatial, scientific, engineering, etc.) • 1990s: • Data mining, data warehousing, multimedia databases, and Web databases • 2000s • Stream data management and mining • Data mining and its applications • Web technology (XML, data integration) and global information systems
Data Mining turn data tombs to golden nuggets of knowledge Data Mining: Concepts and Techniques
Chapter 1. Introduction • Why Data Mining? • What Is Data Mining? • A Multi-Dimensional View of Data Mining • What Kind of Data Can Be Mined? • What Kinds of Patterns Can Be Mined? • What Technology Are Used? • What Kind of Applications Are Targeted? • Major Issues in Data Mining • A Brief History of Data Mining and Data Mining Society • Summary
Data mining—searching for knowledge (interesting patterns) in data. Data Mining: Concepts and Techniques
What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,implicit, previously unknown and potentially useful)patterns or knowledge from huge amount of data • Data mining: a misnomer? • Alternative names • Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. • Watch out: Is everything “data mining”? • Simple search and query processing • (Deductive) expert systems
Data Mining turn data tombs to golden nuggets of knowledge Data Mining: Concepts and Techniques
Knowledge Discovery (KDD) Process Knowledge • This is a view from typical database systems and data warehousing communities • Data mining plays an essential role in the knowledge discovery process Pattern Evaluation Data Mining Task-relevant Data Selection Data Warehouse Data Cleaning Data Integration Databases
Knowledge Discovery (KDD) Process • 1. Data cleaning (to remove noise and inconsistent data) • 2. Data integration (where multiple data sources may be combined) • 3. Data selection (where data relevant to the analysis task are retrieved from the database) • 4. Data transformation (where data are transformed and consolidated into forms appropriate for mining by performing summary or aggregation operations) • 5. Data mining (an essential process where intelligent methods are applied to extract data patterns) • 6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on interestingness measures—see Section 1.4.6) • 7. Knowledge presentation (where visualization and knowledge representation techniques are used to present mined knowledge to users) Data Mining: Concepts and Techniques
Example: A Web Mining Framework • Web mining usually involves • Data cleaning • Data integration from multiple sources • Warehousing the data • Data cube construction • Data selection for data mining • Data mining • Presentation of the mining results • Patterns and knowledge to be used or stored into knowledge-base
Data Mining in Business Intelligence Increasing potential to support business decisions End User DecisionMaking Business Analyst Data Presentation Visualization Techniques Data Mining Data Analyst Information Discovery Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses DBA Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems
Example: Mining vs. Data Exploration • Business intelligence view • Warehouse, data cube, reporting but not much mining • Business objects vs. data mining tools • Supply chain example: tools • Data presentation • Exploration
Data integration Normalization Feature selection Dimension reduction Pattern evaluation Pattern selection Pattern interpretation Pattern visualization KDD Process: A Typical View from ML and Statistics • This is a view from typical machine learning and statistics communities Pattern Information Knowledge Data Mining Post-Processing Data Pre-Processing Input Data Pattern discovery Association & correlation Classification Clustering Outlier analysis … … … …
Example: Medical Data Mining • Health care & medical data mining – often adopted such a view in statistics and machine learning • Preprocessing of the data (including feature extraction and dimension reduction) • Classification or/and clustering processes • Post-processing for presentation
Chapter 1. Introduction • Why Data Mining? • What Is Data Mining? • A Multi-Dimensional View of Data Mining • What Kind of Data Can Be Mined? • What Kinds of Patterns Can Be Mined? • What Technology Are Used? • What Kind of Applications Are Targeted? • Major Issues in Data Mining • A Brief History of Data Mining and Data Mining Society • Summary
Multi-Dimensional View of Data Mining • Data to be mined • Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs & social and information networks • Knowledge to be mined (or: Data mining functions) • Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. • Descriptive vs. predictive data mining • Multiple/integrated functions and mining at multiple levels • Techniques utilized • Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition, visualization, high-performance, etc. • Applications adapted • Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.
Chapter 1. Introduction • Why Data Mining? • What Is Data Mining? • A Multi-Dimensional View of Data Mining • What Kind of Data Can Be Mined? • What Kinds of Patterns Can Be Mined? • What Technology Are Used? • What Kind of Applications Are Targeted? • Major Issues in Data Mining • A Brief History of Data Mining and Data Mining Society • Summary
Data Mining: On What Kinds of Data? • Database-oriented data sets and applications • Relational database, data warehouse, transactional database • Advanced data sets and advanced applications • Data streams and sensor data • Time-series data, temporal data, sequence data (incl. bio-sequences) • Structure data, graphs, social networks and multi-linked data • Object-relational databases • Heterogeneous databases and legacy databases • Spatial data and spatiotemporal data • Multimedia database • Text databases • The World-Wide Web
Mining database data • A relational database for AllElectronics. Relational data can be accessed by database queries written in a relational query language (e.g., SQL) Mining relational databases, searching for trends or data patterns. Data Mining: Concepts and Techniques
Example cont. • Data mining systems can analyze customer data to predict the credit risk of new customers based on their income, age, and previous credit information. • Data mining systems may also detect deviations—that is, items with sales that are far from those expected in comparison with the previous year. Such deviations can then be further investigated. • Data mining may discover that there has been a change in packaging of an item or a significant increase in price. Data Mining: Concepts and Techniques
Data Warehouses • A data warehouse is a repository of information collected from multiple sources, stored under a unified schema, and usually residing at a single site. Data Mining: Concepts and Techniques
Data Cube • A data warehouse is usually modeled by a multidimensional data structure, called a data cube, in which each dimension corresponds to an attribute or a set of attributes in the schema, and each cell stores the value of some aggregate measure. Data Mining: Concepts and Techniques
Transactional Data • A transactional database for AllElectronics. ask,“Which items sold well together?” mining frequent itemsets Data Mining: Concepts and Techniques
Chapter 1. Introduction • Why Data Mining? • What Is Data Mining? • A Multi-Dimensional View of Data Mining • What Kind of Data Can Be Mined? • What Kinds of Patterns Can Be Mined? • What Technology Are Used? • What Kind of Applications Are Targeted? • Major Issues in Data Mining • A Brief History of Data Mining and Data Mining Society • Summary
Data Mining Function • Characterization • Discrimination • Frequent patterns • Association • Correlation • Classification • Regression • Clustering analysis • Outlier analysis Data Mining: Concepts and Techniques
Data Mining Function • Descriptive and predictive. • Descriptive mining tasks characterize properties of the data in a target data set. • Predictive mining tasks perform induction on the current data in order to make predictions. Data Mining: Concepts and Techniques
Class/Concept Description: Characterizationand Discrimination • Data entries can be associated with classes or concepts. • For example, in the AllElectronics store, • classes of items for sale include computers and printers, and • concepts of customers include bigSpenders and budgetSpenders. • It can be useful to describe individual classes and concepts in summarized, concise, and yet precise terms. • Such descriptions of a class or a concept are called class/concept descriptions. Data Mining: Concepts and Techniques
Class/Concept Description • These descriptions can be derived using • (1) data characterization, by summarizing the data of the class under study (often called the target class) in general terms, or • (2) data discrimination, by comparison of the target class with one or a set of comparative classes (often called the contrasting classes), or • (3) both data characterization and discrimination. Data Mining: Concepts and Techniques
Data characterization • Data characterization is a summarization of the general characteristics or features of a target class of data. • The output of data characterization can be presented in • pie charts, bar charts, curves, multidimensional data cubes, and multidimensional tables, including crosstabs. • The resulting descriptions can also be presented as generalized relations or in rule form (called characteristic rules). Data Mining: Concepts and Techniques
Data discrimination • Data discrimination is a comparison of the general features of the target class data objects against the general features of objects from one or multiple contrasting classes. Data Mining: Concepts and Techniques
Mining Frequent Patterns, Associations, and Correlations • where X is a variable representing a customer. • A confidence, or certainty, of 50% means that if a customer buys a computer, there is a 50% chance that she will buy software as well. • A 1% support means that 1% of all the transactions under analysis show that computer and software are purchased together. Data Mining: Concepts and Techniques
Mining Frequent Patterns, Associations, and Correlations • Single-dimensional association rules. • Multidimensional association rule. Data Mining: Concepts and Techniques
Classification and Regression for Predictive Analysis • Classification is the process of finding a model (or function) that describes and distinguishes data classes or concepts. • The model are derived based on the analysis of a set of training data (i.e., data objects for which the class labels are known). • The model is used to predict the class label of objects for which the class label is unknown. • “How is the derived model presented?” The derived model may be represented in various forms Data Mining: Concepts and Techniques
Classification Rule Data Mining: Concepts and Techniques
Decision Tree Data Mining: Concepts and Techniques
Neural network Data Mining: Concepts and Techniques
Classification and Regression for Predictive Analysis • Whereas classification predicts categorical (discrete, unordered) labels, regression models continuous-valued functions. That is, regression is used to predict missing or unavailable numerical data values rather than (discrete) class labels. • The term prediction refers to both numeric prediction and class label prediction. • Regression analysis is a statistical methodology that is most often used for numeric prediction. Regression also encompasses the identification of distribution trends based on the available data. Data Mining: Concepts and Techniques
Example • Suppose instead, that rather than predicting categorical response labels for each store item, you would like to predict the amount of revenue that each item will generate during an upcoming sale at AllElectronics, based on the previous sales data. • This is an example of regression analysis because the regression model constructed will predict a continuous function (or ordered value.) Data Mining: Concepts and Techniques
Cluster Analysis • clustering analyzes data objects without consulting class labels. • Clustering can be used to generate class labels for a group of data. The objects are clustered or grouped based on the principle of maximizing the intraclass similarity and minimizing the interclass similarity. Data Mining: Concepts and Techniques
Example • Cluster analysis can be performed on AllElectronics customer data to identify homogeneous subpopulations of customers. These clusters may represent individual target groups for marketing. Figure 1.10 shows a 2-D plot of customers with respect to customer locations in a city. Three clusters of data points are evident. Data Mining: Concepts and Techniques
Outlier Analysis • A data set may contain objects that do not comply with the general behavior or model of the data. These data objects are outliers. • Many data mining methods discard outliers as noise or exceptions. • However, in some applications (e.g., fraud detection) the rare events can be more interesting than the more regularly occurring ones. • Outliers may be detected using • statistical tests that assume a distribution or probability model for the data, or • using distance measures where objects that are remote from any other cluster are considered outliers. • density-based methods may identify outliers in a local region, although they look normal from a global statistical distribution view. Data Mining: Concepts and Techniques
Interesting Pattern • a pattern is interesting if it is • (1) easily understood by humans, • (2) valid on new or test data with some degree of certainty, • (3) potentially useful, and • (4) novel. • A pattern is also interesting if it validates a hypothesis that the user sought to confirm. • An interesting pattern represents knowledge. Data Mining: Concepts and Techniques
Time and Ordering: Sequential Pattern, Trend and Evolution Analysis • Sequence, trend and evolution analysis • Trend, time-series, and deviation analysis: e.g., regression and value prediction • Sequential pattern mining • e.g., first buy digital camera, then buy large SD memory cards • Periodicity analysis • Motifs and biological sequence analysis • Approximate and consecutive motifs • Similarity-based analysis • Mining data streams • Ordered, time-varying, potentially infinite, data streams
Structure and Network Analysis • Graph mining • Finding frequent subgraphs (e.g., chemical compounds), trees (XML), substructures (web fragments) • Information network analysis • Social networks: actors (objects, nodes) and relationships (edges) • e.g., author networks in CS, terrorist networks • Multiple heterogeneous networks • A person could be multiple information networks: friends, family, classmates, … • Links carry a lot of semantic information: Link mining • Web mining • Web is a big information network: from PageRank to Google • Analysis of Web information networks • Web community discovery, opinion mining, usage mining, …