Theoretic Frameworks for Data Mining

Theoretic Frameworks for Data Mining Reporter: Qi Liu

How can be a framework for data mining? • Encompass all or most typical data mining tasks • Have a probablistic nature • Be able to talk about inductive generations • Deal with different types of data • Recognize the data mining process as an iterative and interactive process • Account for background knowledge in deciding what is an interesting discovery

Statistics Framework • Statistics viewpoint: • Volume of data • Computational feasibility • Database integration • Simplicity of use • Understandablity of results

Machine Learning Framework • Data mining is applied machine learning • Machine learning focuses on the prediction, based on known properties learned from the training data • Data mining focuses on the discovery of (previously) unknown properties on the data • Data mining can not use supervised methods due to unavailability

Probablistic Framework • To find the underlying joint distribution(e.g., Bayesian network) of the variables in the data. • Advantages: • Solid background • Clustering/Classification fit easily into this framework • Lackage: • Can not take the iterative and interactive nature of the data mining process into account

Data Compression Framework • Goal: to compress the data set by finding some structure for it and then encoding the data using few bits. • Minimum description length(MDL) principle • Instances: association rules, a decision tree, clustering

Microeconomic Framework • To find actionable patterns that increase utility • Define utility function from a perspective of customers

Inductive Database Framework • Store both data and patterns • An inductive database I(D,P) consist of a data component D and a pattern component P. • We assume that both the data and the pattern components D and P are sets of sets. This assumption is motivated by an analogy with traditional relational databases. • PS: deductive database: partial rules

Information Theoretic Framework • Data mining is a process of information transimission from an algorithm to data miner. • Model the data miner’s state of mind as a probability distribution, called the background distribution, which represents the uncertainty and misconceptions. • In the data mining process, properties of the data(referred as patterns) are revealed.

Attention! • Focus on the data miner as much as on the data. An interesting pattern should be defined subjectively, rather than objectively. • The primary concern is understanding the data itself, rather than the stochastic source than generated it.

Bird’s eye view on IT framework • A data miner is able to formalize her beliefs in a background distribution, denoted P* • Kraft’s inequality is an equality • Code length of x with a probability P: -log(P(x)) • The entropy of P* could be small due to the data miner being overly confident • Update P* to be a new background distribution P*’ • Measure the reduction of code length: Information gain

Trade-off • Good data mining algorithms are those that are able to pinpoint those patterns that lead to a large information gain. • A trade-off between the information gain due to the revealing of a pattern in the data, and the description length of the pattern, that should define a pattern’s interest to the data miner.

How to determine P* and P*’? • Given a set of probability distribution of maximum entropy. • Given P , i.e. • P is a good surrogate for P*

Patterns • Formalize a pattern as a constraint for some X’ • For a pattern above: • P*(x) = 0 for all x’ • Update P to be P’(called updated surrogate background distribution): • Self-inforamtion(w.r.t. P’) of the pattern is

More issues about the framework • The cost of a pattern should be specified in advance by the data miner. • Joint Patterns • Cases: • Clustering and alternative clustering • Dimensionality reduction(PCA) • Frequent pattern mining • Community detection • Subgroup discovery and supervised learning

Theoretic Frameworks for Data Mining

Theoretic Frameworks for Data Mining

Presentation Transcript

Regression for Data Mining

Data Mining for Earth Science Data

Data Mining: Data

Data Mining: Data

Data Cloud Frameworks

Mining for ADE Data

Data Mining: Data

Data Mining: P enelitian Data Mining

Data Mining

Visual Data Mining: Concepts, Frameworks and Algorithm Development

Data Mining: Data

Data Mining: Data

Dendrograms for Data Mining

Data Mining: Data

Data Mining: Data

Data Mining: Data

Big Data Frameworks

Data Mining for Data Streams

Data Mining: Data

Data Mining for Engineers

Data Mining: Data