1 / 16

Theoretic Frameworks for Data Mining

Theoretic Frameworks for Data Mining. Reporter: Qi Liu. How can be a framework for data mining?. Encompass all or most typical data mining tasks Have a probablistic nature Be able to talk about inductive generations Deal with different types of data

Download Presentation

Theoretic Frameworks for Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Theoretic Frameworks for Data Mining Reporter: Qi Liu

  2. How can be a framework for data mining? • Encompass all or most typical data mining tasks • Have a probablistic nature • Be able to talk about inductive generations • Deal with different types of data • Recognize the data mining process as an iterative and interactive process • Account for background knowledge in deciding what is an interesting discovery

  3. Statistics Framework • Statistics viewpoint: • Volume of data • Computational feasibility • Database integration • Simplicity of use • Understandablity of results

  4. Machine Learning Framework • Data mining is applied machine learning • Machine learning focuses on the prediction, based on known properties learned from the training data • Data mining focuses on the discovery of (previously) unknown properties on the data • Data mining can not use supervised methods due to unavailability

  5. Probablistic Framework • To find the underlying joint distribution(e.g., Bayesian network) of the variables in the data. • Advantages: • Solid background • Clustering/Classification fit easily into this framework • Lackage: • Can not take the iterative and interactive nature of the data mining process into account

  6. Data Compression Framework • Goal: to compress the data set by finding some structure for it and then encoding the data using few bits. • Minimum description length(MDL) principle • Instances: association rules, a decision tree, clustering

  7. Microeconomic Framework • To find actionable patterns that increase utility • Define utility function from a perspective of customers

  8. Inductive Database Framework • Store both data and patterns • An inductive database I(D,P) consist of a data component D and a pattern component P. • We assume that both the data and the pattern components D and P are sets of sets. This assumption is motivated by an analogy with traditional relational databases. • PS: deductive database: partial rules

  9. Information Theoretic Framework • Data mining is a process of information transimission from an algorithm to data miner. • Model the data miner’s state of mind as a probability distribution, called the background distribution, which represents the uncertainty and misconceptions. • In the data mining process, properties of the data(referred as patterns) are revealed.

  10. Attention! • Focus on the data miner as much as on the data. An interesting pattern should be defined subjectively, rather than objectively. • The primary concern is understanding the data itself, rather than the stochastic source than generated it.

  11. Bird’s eye view on IT framework • A data miner is able to formalize her beliefs in a background distribution, denoted P* • Kraft’s inequality is an equality • Code length of x with a probability P: -log(P(x)) • The entropy of P* could be small due to the data miner being overly confident • Update P* to be a new background distribution P*’ • Measure the reduction of code length: Information gain

  12. Trade-off • Good data mining algorithms are those that are able to pinpoint those patterns that lead to a large information gain. • A trade-off between the information gain due to the revealing of a pattern in the data, and the description length of the pattern, that should define a pattern’s interest to the data miner.

  13. How to determine P* and P*’? • Given a set of probability distribution of maximum entropy. • Given P , i.e. • P is a good surrogate for P*

  14. Patterns • Formalize a pattern as a constraint for some X’ • For a pattern above: • P*(x) = 0 for all x’ • Update P to be P’(called updated surrogate background distribution): • Self-inforamtion(w.r.t. P’) of the pattern is

  15. More issues about the framework • The cost of a pattern should be specified in advance by the data miner. • Joint Patterns • Cases: • Clustering and alternative clustering • Dimensionality reduction(PCA) • Frequent pattern mining • Community detection • Subgroup discovery and supervised learning

More Related