160 likes | 267 Views
Theoretic Frameworks for Data Mining. Reporter: Qi Liu. How can be a framework for data mining?. Encompass all or most typical data mining tasks Have a probablistic nature Be able to talk about inductive generations Deal with different types of data
E N D
Theoretic Frameworks for Data Mining Reporter: Qi Liu
How can be a framework for data mining? • Encompass all or most typical data mining tasks • Have a probablistic nature • Be able to talk about inductive generations • Deal with different types of data • Recognize the data mining process as an iterative and interactive process • Account for background knowledge in deciding what is an interesting discovery
Statistics Framework • Statistics viewpoint: • Volume of data • Computational feasibility • Database integration • Simplicity of use • Understandablity of results
Machine Learning Framework • Data mining is applied machine learning • Machine learning focuses on the prediction, based on known properties learned from the training data • Data mining focuses on the discovery of (previously) unknown properties on the data • Data mining can not use supervised methods due to unavailability
Probablistic Framework • To find the underlying joint distribution(e.g., Bayesian network) of the variables in the data. • Advantages: • Solid background • Clustering/Classification fit easily into this framework • Lackage: • Can not take the iterative and interactive nature of the data mining process into account
Data Compression Framework • Goal: to compress the data set by finding some structure for it and then encoding the data using few bits. • Minimum description length(MDL) principle • Instances: association rules, a decision tree, clustering
Microeconomic Framework • To find actionable patterns that increase utility • Define utility function from a perspective of customers
Inductive Database Framework • Store both data and patterns • An inductive database I(D,P) consist of a data component D and a pattern component P. • We assume that both the data and the pattern components D and P are sets of sets. This assumption is motivated by an analogy with traditional relational databases. • PS: deductive database: partial rules
Information Theoretic Framework • Data mining is a process of information transimission from an algorithm to data miner. • Model the data miner’s state of mind as a probability distribution, called the background distribution, which represents the uncertainty and misconceptions. • In the data mining process, properties of the data(referred as patterns) are revealed.
Attention! • Focus on the data miner as much as on the data. An interesting pattern should be defined subjectively, rather than objectively. • The primary concern is understanding the data itself, rather than the stochastic source than generated it.
Bird’s eye view on IT framework • A data miner is able to formalize her beliefs in a background distribution, denoted P* • Kraft’s inequality is an equality • Code length of x with a probability P: -log(P(x)) • The entropy of P* could be small due to the data miner being overly confident • Update P* to be a new background distribution P*’ • Measure the reduction of code length: Information gain
Trade-off • Good data mining algorithms are those that are able to pinpoint those patterns that lead to a large information gain. • A trade-off between the information gain due to the revealing of a pattern in the data, and the description length of the pattern, that should define a pattern’s interest to the data miner.
How to determine P* and P*’? • Given a set of probability distribution of maximum entropy. • Given P , i.e. • P is a good surrogate for P*
Patterns • Formalize a pattern as a constraint for some X’ • For a pattern above: • P*(x) = 0 for all x’ • Update P to be P’(called updated surrogate background distribution): • Self-inforamtion(w.r.t. P’) of the pattern is
More issues about the framework • The cost of a pattern should be specified in advance by the data miner. • Joint Patterns • Cases: • Clustering and alternative clustering • Dimensionality reduction(PCA) • Frequent pattern mining • Community detection • Subgroup discovery and supervised learning