520 likes | 641 Views
A Brief Overview of Data Mining. - IR Group Meeting 04/11/2006 Qiaozhu Mei. Outline. Introduction Functionalities Hot topics Research Groups Useful Resources. Part 1: Introduction. Introduction What is data mining? General Process Related Fields Different Views Functionalities
E N D
A Brief Overview of Data Mining - IR Group Meeting 04/11/2006 Qiaozhu Mei
Outline • Introduction • Functionalities • Hot topics • Research Groups • Useful Resources
Part 1: Introduction • Introduction • What is data mining? • General Process • Related Fields • Different Views • Functionalities • Hot topics • Research Groups • Useful Resources
What is Data Mining? • (From Prof. Jiawei Han’s Slides): Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data • (From Prof. Sunita Sarawagi’s slides): Process of semi-automatically analyzing large databases to find patterns that are • valid: hold on new data with some certainty • novel: non-obvious to the system • useful: should be possible to act on the item • understandable: humans should be able to interpret the pattern • (From Prof. Vipin Kumar’ Slides): Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns
What is Data Mining? (cont.) • Under these definitions: • What is not Data Mining? • Look up phone number in phone directory • Query a Web search engine for information about “Amazon” • What is Data Mining? • Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area) • Group together similar documents returned by search engine according to their context - Tan, Steinbach, Kumar, Introduction to Data Mining
General Process of KDD Knowledge • Data mining—core of knowledge discovery process Pattern Evaluation Data Mining Task-relevant Data Selection Data Warehouse Data Cleaning Data Integration Databases - Han & Kamber, Data Mining: Concepts and Techniques
Database Technology Statistics Statistics/AI Machine Learning/ Pattern Recognition Data Mining Machine Learning Data Mining Visualization Database systems Algorithm Other Disciplines Related Fields • Confluence of Multiple Disciplines • Han & Kamber, Data Mining: • Concepts and Techniques • Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems • But different… - Tan, Steinbach, Kumar, Introduction to Data Mining
Differences to Related Fields • Traditional Techniques may be unsuitable due to • Enormity of data • High dimensionality of data • Heterogeneous, distributed nature of data • Overlaps with machine learning, statistics, artificial intelligence, databases, visualization, but more stress on • scalability of number of features and instances • stress on algorithms and architectures whereas foundations of methods and formulations provided by statistics and machine learning. • automation for handling large, heterogeneous data • From Prof. Vipin Kumar’s slides • From Prof. Sunita Sarawagi’s slides
Different Views of Data Mining • Categorize a data mining task from different views • By general functionality and operations: • Descriptive data mining • Find human-interpretable patterns that describe the data. • Clustering / similarity matching • Association rules and variants • Deviation detection • Predictive data mining • Use some variables to predict unknown or future values of other variables. • Regression • Classification • Collaborative Filtering
Different Views of Data Mining (II) • By data to be mined • Relational, data warehouse, transactional, stream, object-oriented, sequence, graph, social network, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW • By knowledge to be discovered • Characterization, discrimination, frequent patterns, association, classification, clustering, trend/deviation, outlier analysis, etc • By techniques utilized • Database-oriented, data warehouse (OLAP), combinational algorithms, machine learning, statistics, visualization, etc. • By application adapted • Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc. - Han & Kamber, Data Mining: Concepts and Techniques
Part 2: Functionalities • Introduction • Functionalities • Data Warehousing and OLAP • Frequent patterns, association, correlation and causality • Classification and prediction • Clustering • Outlier analysis, Trend and evolution analysis • Hot topics • Research Groups • Useful Resources
all country product date product,date product,country date, country product, date, country Data Warehousing and OLAP • Data Warehousing: • “A data warehouse is asubject-oriented, integrated, time-variant, and nonvolatilecollection of data in support of management’s decision-making process.”—W. H. Inmon • OLAP: on-line analytical processing • Major task of data warehouse system • Data analysis and decision making • Drill-down, roll-up, exception/discovery driven • Methodology • Data Cubing • Iceberg cube • Multi-way, BUC, Star, MM, shell, close-cube, etc. - Han & Kamber, Data Mining: Concepts and Techniques
Frequent Patterns and Associations • Frequent pattern: a pattern (itemsets, subsequences, substructures, etc.) that occurs frequently in a data set • Comparing to n-grams, phrases, etc. • Motivation: Finding inherent regularities in data • Applications: Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis • Association rule mining: • Given a set of records each of which contain some number of items from a given collection; • Produce dependency rules which will predict occurrence of an item based on occurrences of other items. • Frequent pattern association rules correlations
Mining Frequent Patterns • Types of data: • Itemsets, sequences, graphs. • Scalable mining methods: Three major approaches • Apriori (Agrawal & Srikant@VLDB’94) • FPgrowth (Han, Pei & Yin @SIGMOD’00) • Prefixspan, clospan, gSpan, closegraph, etc. • Vertical data format approach (Charm, Zaki & Hsiao @SDM’02) • Apriori: • Candidate pattern generation and pruning • Breadth-first search over pattern space • FPgrowth: • Pattern growth through FP-tree, no candidate generation • Depth-first search, doing pruning smartly
Classification and Prediction • Supervised Learning, already discussed in Machine Learning. • Classification: classifies data (constructs a model) based on the training set and the values (categorical class labels) in a classifying attribute and uses it in classifying new data • Prediction: models continuous-valued functions, i.e., predicts unknown or missing values • Algorithms: • Decision Tree based: C4.5, ID3, Rainforest, etc. • Bayesian Method: Naïve Bayesian, Bayesian network, a lot of others covered in Machine Learning.. • Discriminative: Perceptron/Winnow, NN, SVM, CB-SVM, etc. • Rule-based, Associative, k-NN, etc. • Prediction: Regression, • Bagging, Boosting, Model Selection, Cross-Validation
Clustering • Unsupervised Learning, as discussed in Machine Learning • Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that • Data points in one cluster are more similar to one another. • Data points in separate clusters are less similar to one another. • Similarities/distances: many! • Algorithms: • Partition based: K-means, K-Medoids, CLARA, etc • Hierarchical: Bottom-up (single/complete/average link), top-down, Birch • Density-based/Grid-based: DBSCAN, DENCLUE, CLIQUE, etc. • Model-based: EM, COBWEB, SOM, etc. • High-Dimensional, Constraint based
Outlier, Trend and Evolution • outliers: The set of objects that are considerably dissimilar from the remainder of the data • Statistical: hypothesis testing, bug mining • Density based • Clustering based, etc • Deviation/Anomaly Detection • Fraud Detection • Trend and Evolution: • Usually coupled with outlier analysis • Basic functionalities in temporal data mining • Trend, cycle, seasonal, irregular patterns
Part 3: Hot Topics • Introduction • Functionalities • Hot topics • Mining data stream, Mining time series, Spatiotemporal data mining, mining Social Networks, Sequential data mining, Graph Mining, Biology data mining, Privacy Preserving Data Mining • Text and Web mining • Research Groups • Useful Resources
Mining Data Streams • Data: Data streams—continuous, ordered, changing fast, huge amount • Characteristics and Challenges : • Huge volumes • Fast changing, requires fast and real-time response • Random access is expensive — need single scan algorithms • Difficult to keep the universe — need approximations • Basic problems: • Multi-dimensional on-line analysis of streams • Mining outliers and unusual patterns in stream data • Clustering data streams • Classification of stream data
Mining Data Streams (II) • Methods: • Basic: Sliding windows, Tilted time frames • Counting (FP mining, etc): • Random sampling • Approximated counting • OLAP: • Keep Critical layers in stream cube computation • Partial materialization • outlier: exception-based exploration • Clustering: • Offline microclustering and online macroclustering • Text Related Applications: • Web logs and Web page click streams
Mining Time series • Data: Time-series database • Consists of sequences of values or events changing with time • Data is recorded at regular intervals • Characteristics and Challenges : • Characteristic time-series components: Trend, cycle, seasonal, irregular patterns • Basic Problems: • Trends discovery, Similarity Search, outlier detection, prediction and clustering
Mining Time series (II) • Methods: • Statistical modeling (Regression, Spline, Mixture Model, etc) • Data transformation (DFT, DWT) • Sliding windows, Atomic matching, window stitching, Subsequence Ordering • Clustering • Text Related Applications: • Transliteration mining, Temporal text mining, word bursting, etc. • Han & Kamber, Data Mining: • Concepts and Techniques
Spatiotemporal data mining • Data: object data sets, spatial/spatiotemporal databases and data warehouses • Characteristics and Challenges: • Generalize detailed geographic points into clustered regions, such as business, residential, industrial, or agricultural areas, according to land usage • handling objects in space that have identity and well-defined extents, locations, and relationships. • Require the merge of a set of geographic areas by spatial operations • Basic Problems: • Querying objects; distribution/cluster/correlation/evolution/trend analysis
Spatiotemporal data mining (II) • Methods • GIS (Geographic Information System): Analysis and visualization of geographic data • Search, Location analysis, Terrain analysis, Distribution, Spatial analysis/statistics, Measurement • Indexing Spatial data (R-tree, etc. ) • Modeling single objects with points, lines and regions • Modeling spatially related collection of objects: plane partitions and networks. • Spatiotemporal patterns, correlations, trend analysis, clustering… • Text Related Applications: • Spatiotemporal text mining; community evolution in weblogs; • Information diffusing; web evolution
Special topics in Frequent Pattern Mining • Association rule mining and frequent itemset mining are pretty old topics • However, some special topics of frequent pattern mining are still hot • Sequential pattern mining • Graph mining • Pattern post-processing
Sequential pattern mining • Data: sequential data base • Basic problems: • Discovery of frequent subsequences (allow gap, comparing to n-grams); close subsequences • Sequence Similarity Search, Sequence Alignment • Methods: • Apriori: GSP • FP-Growth: PrefixSpan, Clospan • BLAST, Hidden Markov models, CRF, etc. • Text Related Applications: • Most text patterns are sequential patterns • Phrase extraction, entity/relation extraction, opinion mining, etc • Biology sequence modeling • Han & Kamber, Data Mining: • Concepts and Techniques
Graph Mining • Data: graph databases (like social network, but multiple graphs, more general), examples include • Chemical component, protein structure, program flow, XML/Web, • Directed, undirected, labeled/unlabeled, weighted, 2-D/3-D, etc. • Characteristics and Challenges: • Theoretically, most are of high complexity, but practically, the graphs are solvable. • Too many substructures to index • … • Basic problems • Frequent subgraph mining • Close subgraph mining • Graph indexing by substructures • Similarity search • Han & Kamber, Data Mining: Concepts and Techniques
Graph Mining (II) • Methods: • Subgraph mining: Apriori (e.g. FSG), Pattern Growth (e.g. gSpan) • gSpan: pattern growth, depth first search, active elimination of duplicated subgraphs; Flatten a graph into a sequence using depth first search; enumerate graph using right-most extension. • CloseGraph: mining close subgraph patterns • gIndex: identify frequent structures, prune redundancy to maintain discriminative structures, create index on such structures. • Similarity search: indexing; feature based similarities; estimate feature missing • Text Related Applications: • Multi-resolution topic map, entity-relation network, pathway extraction, etc.
Graph Mining (III): Graph Indexing & Querying • More on Graph Indexing and Similarity Search • Comparing to Text Retrieval:
Graph Mining (IV): Graph Indexing & Querying • What if we want to index on phrases instead of words? • Need to extract phrases first • N-grams/sequential patterns, have to remove redundancy • E.g. “natural language processing” v.s. “language processing” • Substructures are like phrases… • Can IR help? • Representation and Similarity measures? (Vector Space Models, Probabilistic models…) • How to weight features? (TF-IDF, …) • Generative models? • Query expansion? Feedback?
Pattern Post-processing • Data: frequent patterns extracted by mining algorithms • Challenge: • Mining algorithms output explosively large number of patterns • How to interpret the frequent patterns extracted • Basic Problems: • Pattern summarization • Mining compressed patterns • Top-K patterns • Pattern annotation • User-oriented ranking • Methods: • Modeling Pattern profiles, coverage and contexts • Using Clustering to summarize and compress patterns • Bridging IR/NLP and frequent pattern mining: profile, context, ranking, feedback, filtering, summarization, MMR, etc.
Mining Social Networks • Data: Graphs/networks with nodes and links • Example: communication networks, webpages, citations, biological pathways, etc. • Characteristics and Challenges: • Connected Components: few • Network diameter: small • Clustering: high degree • Degree distribution: heavy-tailed • Modeling Logical/statistical dependencies • Basic Problems: • Model the generation of graphs/networks • Link based object ranking, classification, Identification, Clustering, entity resolution • Link Prediction, querying, community discovery H. Jeong, S.P. Mason, A.-L. Barabasi, Z.N. Oltvai, Nature 411, 41-42 (2001)
Mining Social Networks (II) • Methods: • Graph Generation Models: trying to derive generative models which explains the characteristics and evolutions of social networks/graphs. • Vertex Ranking: PageRank, HITS, etc. • Community Detection: Hierarchical Clustering, Spectral clustering, Stochastic modeling, etc. • Link based classification: semi-supervised learning, propagation • Entity resolution: duplicate prediction, collective resolution, probabilistic models • Link Prediction: binary classification problem, local conditional probabilistic models • Substructure mining: graph pattern mining, indexing
Mining Social Networks (III) • Generative Models of social network/graph generation and evolution • Random graphs (Erdös-Rényi models) • Fix vertices, generate each edge independently with probability p • N(N-1)/2 trials of a biased coin flip, p ~ 1/N • Degree distribution is Poisson, E[d] = p(N-1); E[# of e] = pN(N-1)/2 • Parameter: p • Graph process model: • starting with no edges, just keep adding one edge at a time • always choose next edge randomly from among all missing edges
Mining Social Networks (IV) • α-model (Watts-Strogatz models, Small-world) • For vertices u, v, define m(u,v) to be the number of common neighbors (so far) • Define the propensity R(u,v) of u to connect to v • if m(u,v) >= k, R(u,v) = 1 (share too many friends, must connect) • if m(u,v) = 0, R(u,v) = p (no mutual friends no bias to connect) • else, R(u,v) = p + (m(u,v)/k)a (1-p) biased to connect • Generate network incrementally, with R(u,v) as the edge probability; • α ∞, is similar to Erdos-Renyi models • Need to tune parameter α, p, k
Mining Social Networks (V) • Scale free models: not fix N (# of vertices) • Start with (say) two vertices connected by an edge • let Z = Σ d(j) where d(j) = degree of vertex j so far • add new vertex i with k edges back to {1, …, i-1}: i is connected back to j with probability d(j)/Z • Richer get richer… • Evaluation of generative models • Can they explain all the characteristics of social networks? • Parameter tuning? • Other models for Social network analysis • Copying model: leads to communities • Forest Fire Model • Electricity network (not generative model, but interesting)
Mining Social Networks (VI) • Text Related Applications: quite a lot! • Ranking webpages • Multi-resolution Concept/Topic Map • Citation Impact of scientific literature • Entity-relation extraction • Bioinformatics: Pathway extraction • Reference Reconciliation • Web structure evolution • Community discovery in Weblogs..
Text and Web mining • Data: text, unstructured/semi-structured; webpages with linkages, user logs; • E.g. webpage, news, email, weblogs, scientific literature, citations, customer reviews, forums, search logs, chatting logs, legal documents, etc. • Challenges: • Modeling unstructured/semi-structured data • Coupling with Natural Language Processing • Handling high dimensionality • Handling data sparseness and ambiguity • The Web is too complicated!
Text and Web mining (II) • Selected Problems: • Text categorization/clustering (Already covered in NLP and ML) • Word sense disambiguation (Covered in NLP) • Information Extraction (Covered in NLP) • Dimension Reduction (Overlapping with ML and IR) • Collaborative Filtering, User-interest modeling • Topic Detection and Tracking • Comparative Text Mining, Theme based text mining • Transliteration mining • Email clustering / spam detection • Opinion mining (Overlapping with NLP) • Social Networks Related (Already covered) • Temporal Text Mining • Vision based page segmentation / Block based search
Text and Web mining (III) • Methods: Confluence of Multiple Disciplines • Database: data integration, schema matching, XML • Data mining: sequential pattern mining, association rule mining, … • IR: Search, language models, feedback, … • Machine Learning: SVD, Supervised/unsupervised learning, semi-supervised learning, Topic-models, … • NLP: POS tagging, parsing, context modeling, sentiment extraction, entity extraction, … • Statistical Learning: Bayesian methods, word bursting, time-series analysis, hypothesis testing, other statistical models, …
Text and Web mining (IV) • Resolution: • Word level: Word sense disambiguation, word bursting, transliteration mining • Entity level: information extraction, entity-relation network • Pattern level: opinion mining, relation extraction • Document level: document classification/clustering • Theme level: PLSI, LDA, comparative text mining, temporal text mining/spatiotemporal text mining • Topic level: topic detection and tracking, email threading • Web level: social network, weblog mining, block based search • Selected topics will be discussed in next meeting..
Part 4: Research Groups • Introduction • Functionalities • Hot topics • Research Groups • Stanford, CMU, UIUC, Wisc, Helsinki, UMN • IBM, Microsoft, MSRA, Yahoo! • Others • Useful Resources
Research Groups • Rakesh Agrawal • One of the Leaders in Data Mining • Frequent patterns, Privacy Preserved Data Mining • Stanford: Jerome H. Friedman • http://www-stat.stanford.edu/~jhf/ • Strong Statistical flavor, machine learning, boosting • CMU: Christos Faloutsos • http://www.cs.cmu.edu/~christos/ • Graph mining, Social Networks, Stream data mining, Image/Multimedia mining, time-series mining • UIUC: Jiawei Han • http://www-sal.cs.uiuc.edu/~hanj/ • Many! Frequent pattern mining, graph mining, OLAP/Cubing, Stream data mining, Classification, Clustering, …
Research Groups (II) • University of Helsinki: Heikki Mannila • http://www.cs.helsinki.fi/research/fdk/ • http://www.cs.helsinki.fi/u/mannila/ • Frequent itemset mining, computational biology • Wisconsin: Raghu Ramakrishnan • http://www.cs.wisc.edu/dmi/ • http://www.cs.wisc.edu/~raghu/ • Data warehousing, cubing, classification/clustering, • Minnesota: Vipin Kumar • http://www-users.cs.umn.edu/~kumar/ • Spatiotemporal data mining • IBM T.J Watson: Philip S. Yu • http://domino.research.ibm.com/comm/research.nsf/pages/r.kdd.html • http://www.research.ibm.com/people/p/psyu/index.html • Frequent pattern mining, graph mining, data streams
Research Groups (III) • Microsoft Research Redmond: Surajit Chaudhuri • http://research.microsoft.com/dmx/ • Data base related, Data cleaning, etc. • Microsoft Research Redmond: Eric Brill • http://research.microsoft.com/tmsn/ • http://research.microsoft.com/~brill/ • Text Mining, Search and Navigation Research, NLP • Microsoft Research Asia: • http://research.microsoft.com/wsm/ • Web search, web/text mining • Yahoo! Research: Prabhakar Raghavan • http://research.yahoo.com/researcher.shtml • http://theory.stanford.edu/~pragh/ • Web/Text Mining, Social Networks
Research Groups (IV) • IBM Webfountain • http://www.almaden.ibm.com/webfountain/ • UIC: Bing Liu • http://www.cs.uic.edu/~liub/ • Association rule mining, web/text mining • UNC: Wei Wang • http://www.cs.unc.edu/~weiwang/ • Biology data mining, frequent pattern mining • Simon Fraser: Jian Pei • http://www.cs.sfu.ca/~jpei/ • Sequential pattern mining, OLAP • National University of Singapore: Anthony K.H. Tung • http://www.comp.nus.edu.sg/~atung/ • Spatial data mining, Biology data mining • …
Part 5: Useful Resources • Introduction • Functionalities • Hot topics • Research Groups • Useful Resources • Text Books • Toolkits • Conferences • Others
Text Books • S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002 • R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000 • T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003 • U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996 • U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001 • J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2nd ed., 2006 • D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001 • T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer-Verlag, 2001 • T. M. Mitchell, Machine Learning, McGraw Hill, 1997 • G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991 • P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005 • S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998 • I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2nd ed. 2005 - From Prof. Jiawei Han’s slides
Toolkits • Weka: Data mining software in Java • http://www.cs.waikato.ac.nz/%7Eml/weka/ • IlliniMine (Illinois Data Mining System) • http://illimine.cs.uiuc.edu/ • Data Cubing • Frequent Pattern Mining • Sequential pattern mining • Graph pattern Mining • Classification • Collected by Vipin Kumar: • http://www-users.cs.umn.edu/~kumar/dmbook/resources.htm
Conferences • Other related conferences • ACM SIGMOD • VLDB • (IEEE) ICDE • WWW, SIGIR • ICML, CVPR, NIPS • Journals • Data Mining and Knowledge Discovery (DAMI or DMKD) • IEEE Trans. On Knowledge and Data Eng. (TKDE) • KDD Explorations • ACM Trans. on KDD • KDD Conferences • ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining (KDD) • SIAM Data Mining Conf. (SDM) • (IEEE) Int. Conf. on Data Mining (ICDM) • Conf. on Principles and practices of Knowledge Discovery and Data Mining (PKDD) • Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD) - From Prof. Jiawei Han’s slides