160 likes | 410 Views
Data Mining I. Jagdish Gangolly State University of New York at Albany. Data Mining . What is Data mining? Data mining primitives Task-relevant data Kinds of knowledge to be mined Background knowledge Interestedness measures Visualisation of discovered patterns Query language.
E N D
Data Mining I Jagdish Gangolly State University of New York at Albany Acc 522 Fall 2001 (Jagdish S. Gangolly)
Data Mining • What is Data mining? • Data mining primitives • Task-relevant data • Kinds of knowledge to be mined • Background knowledge • Interestedness measures • Visualisation of discovered patterns • Query language Acc 522 Fall 2001 (Jagdish S. Gangolly)
Data Mining • Concept Description (Descriptive Datamining) • Data generalisation • Data cube (OLAP) approach (offline pre-computation) • Attribute-oriented induction approach (online aggregation) • Presentation of generalisation • Descriptive Statistical Measures and Displays Acc 522 Fall 2001 (Jagdish S. Gangolly)
What is Data mining? • Discovery of knowledge from Databases • A set of data mining primitives to facilitate such discovery (what data, what kinds of knowledge, measures to be evaluated, how the knowledge is to be visualised) • A query language for the user to interactively visualise knowledge mined Acc 522 Fall 2001 (Jagdish S. Gangolly)
Data mining primitives I • Task-relevant data: attributes relevant for the study of the problem at hand • Kinds of knowledge to be mined: characterisation, discrimination, association, classification, clustering, evolution,… • Background knowledge: Knowledge about the domain of the problem (concept hierarchies, beliefs about the relationships, expected patterns of data, …) Acc 522 Fall 2001 (Jagdish S. Gangolly)
Data mining primitives II • Interestedness measures: support measures (prevalence of rule pattern) and confidence measures(strength of the implication of the rule) • Visualisation of discovered patterns: rules, tables, charts, graphs, decision trees, cubes,… Acc 522 Fall 2001 (Jagdish S. Gangolly)
Task-relevant Data Steps: • Derivation of initial relation through database queries (data retrieval operations). (Obtaining a minable view) • Data cleaning & transformation of the initial relation to facilitate mining • Data mining Acc 522 Fall 2001 (Jagdish S. Gangolly)
Kinds of knowledge to be mined • Kinds of knowledge & templates (meta-patterns, meta-rules, meta-queries) • Association An Example: age(X:customer, W) Λ income(X, Y) buys(X, Z) • Classification • Discrimination • Clustering • Evolution analysis Acc 522 Fall 2001 (Jagdish S. Gangolly)
Background knowledge • Knowledge from the problem domain • usually in the form of • concept hierarchies (rolling up or drilling down) • schema hierarchies (lattices) • set-grouping hierarchies (successive sub-grouping of attributes) • rule-based hierarchies Acc 522 Fall 2001 (Jagdish S. Gangolly)
Interestedness measures I • Simplicity: More complex the structure, the more difficult it is to interpret, and so likely to be less interesting (rule length,…) • Certainty: Validity, trustworthiness # tuples containing both A and B confidence(AB) # tuples containing A Sometimes called “certainty factor” Acc 522 Fall 2001 (Jagdish S. Gangolly)
Interestedness measures II • Utility: Support is the percentage of task-relevant data tuples for which the pattern is true # tuples containing both A and B support(AB) total # tuples Acc 522 Fall 2001 (Jagdish S. Gangolly)
Visualisation of discovered patterns • Hierarchies • tables • pie/bar charts • dot/box plots • …… Acc 522 Fall 2001 (Jagdish S. Gangolly)
Descriptive Datamining (Concept Description & Characterisation ) • Concept description:Description of data generalised at multiple levels of abstraction • Concept characterisation: Concise and succinct summarisation of a given collection of data • Concept comparison: Discrimination Acc 522 Fall 2001 (Jagdish S. Gangolly)
Data Generalisation • Abstraction of task-relevant high conceptual level data from a database containing relatively low conceptual level data • Data cube (OLAP) approach (offline pre-computation) (Figs 2.1 & 2.2, pages 46 &47) • Attribute-oriented induction approach (online aggregation) • Presentation of generalisation (Tables 5.3 & 5.4 on p. 191, and Figs 5.2, 5.3, & 5.4 on pages 192 & 193) Acc 522 Fall 2001 (Jagdish S. Gangolly)
Descriptive Statistical Measures and Displays I • Measures of central tendency • Mean, Weighted mean (weights signifying importance or occurrence frequency) • Median • Mode • Measures of dispersion • Quartiles, outliers, boxplots Acc 522 Fall 2001 (Jagdish S. Gangolly)
Descriptive Statistical Measures and Displays II • Displays • Histograms (Fig 5.6, page 214) • Barcharts • Quantile plot (Fig 5.7, page 215) • Quantile-Quantile plot (Fig 5.8, page 216) • Scatter plot (Fig 5.9, page 216) • Loess curve (Fig 5.10, page 217) Acc 522 Fall 2001 (Jagdish S. Gangolly)