1 / 34

Prediction Cubes

Prediction Cubes. Bee-Chung Chen, Lei Chen, Yi Lin and Raghu Ramakrishnan University of Wisconsin - Madison. Subset Mining. We want to find interesting subsets of the dataset Interestingness: Defined by the “model” built on a subset

brazelton
Download Presentation

Prediction Cubes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Prediction Cubes Bee-Chung Chen, Lei Chen, Yi Lin and Raghu Ramakrishnan University of Wisconsin - Madison

  2. Subset Mining • We want to find interesting subsets of the dataset • Interestingness: Defined by the “model” built on a subset • Cube space: A combination of dimension attribute values defines a candidate subset (just like regular OLAP) • We want the measures to represent decision/prediction behavior • Summarize a subset using the “model” built on it • Big change from regular OLAP!

  3. The Idea • Build OLAP data cubes in which cell values represent decision/prediction behavior • In effect, build a tree for each cell/region in the cube—observe that this is not the same as a collection of trees used in an ensemble method! • The idea is simple, but it leads to promising data mining tools • Ultimate objective: Exploratory analysis of the entire space of “data mining choices” • Choice of algorithms, data conditioning parameters …

  4. Example (1/7): Regular OLAP Location Time Z: Dimensions Y: Measure Goal: Look for patterns of unusually high numbers of applications:

  5. Example (2/7): Regular OLAP 04 03 … Coarser regions CA 100 90 … USA 80 90 … … … … … 2004 … Jan … Dec … Roll up CA AB 20 15 15 … Drill down … 5 2 20 … 2004 2003 … YT 5 3 15 … Jan … Dec Jan … Dec … USA AL 55 … … … CA 30 20 50 25 30 … … … 5 … … USA 70 2 8 10 … … … WY 10 … … … … … … … … … … … … … … … … … Cell value: Number of loan applications Goal: Look for patterns of unusually high numbers of applications: Z: Dimensions Y: Measure Finer regions

  6. Example (3/7): Decision Analysis Cube subset Location Time Race Sex … Approval Model h(X, Z(D)) E.g., decision tree AL, USA Dec, 04 White M … Yes … … … … … … WY, USA Dec, 04 Black F … No Location Time Goal: Analyze a bank’s loan decision processw.r.t. two dimensions: Location and Time Fact table D Z: Dimensions X: Predictors Y: Class

  7. Example (3/7): Decision Analysis • Are there branches (and time windows) where approvals were closely tied to sensitive attributes (e.g., race)? • Suppose you partitioned the training data by location and time, chose the partition for a given branch and time window, and built a classifier. You could then ask, “Are the predictions of this classifier closely correlated with race?” • Are there branches and times with decision making reminiscent of 1950s Alabama? • Requires comparison of classifiers trained using different subsets of data.

  8. Example (4/7): Prediction Cubes Data [USA, Dec 04](D) Location Time Race Sex … Approval AL ,USA Dec, 04 White M … Y … … … … … … WY, USA Dec, 04 Black F … N Model h(X, [USA, Dec 04](D)) E.g., decision tree • Build a model using data from USA in Dec., 1985 • Evaluate that model • Measure in a cell: • Accuracy of the model • Predictiveness of Race • measured based on that • model • Similarity between that • model and a given model

  9. Example (5/7): Model-Similarity Data table D Location Time Race Sex … Approval AL, USA Dec, 04 White M … Yes … … … … … … WY, USA Dec, 04 Black F … No 2004 2003 … Jan … Dec Jan … Dec … CA 0.4 0.2 0.3 0.6 0.5 … … Build a model USA 0.2 0.3 0.9 … … … … … … … … … … … Similarity Race Sex … Level: [Country, Month] Yes Yes White F … … … … … … No Yes Black M … h0(X) Test set  Given: - Data table D - Target model h0(X) - Test set  w/o labels The loan decision process in USA during Dec 04 was similar to a discriminatory decision model

  10. Example (6/7): Predictiveness Given: - Data table D - Attributes V - Test set  w/o labels Data table D Yes No . . Yes Yes No . . No Build models h(XV) h(X) Level: [Country, Month] Predictiveness of V Race was an important predictor of loan approval decision in USA during Dec 04 Test set 

  11. Example (7/7): Prediction Cube Roll up 04 03 … CA 0.3 0.2 … USA 0.2 0.3 … … … … … 2004 2003 … Jan … Dec Jan … Dec … CA AB 0.4 0.2 0.1 0.1 0.2 … … … 0.1 0.1 0.3 0.3 … … … YT 0.3 0.2 0.1 0.2 … … … USA AL 0.2 0.1 0.2 … … … … Drill down … 0.3 0.1 0.1 … … … WY 0.9 0.7 0.8 … … … … … … … … … … … … … Cell value: Predictiveness of Race

  12. Efficient Computation • Reduce prediction cube computation to data cube computation • Represent a data-mining model as a distributive or algebraic (bottom-up computable) aggregate function, so that data-cube techniques can be directly applied

  13. Bottom-Up Data Cube Computation Cell Values: Numbers of loan applications

  14. Functions on Sets • Bottom-up computable functions: Functions that can be computed using only summary information • Distributive function: (X) = F({(X1), …, (Xn)}) • X = X1 …  Xn and Xi  Xj = • E.g., Count(X) = Sum({Count(X1), …, Count(Xn)}) • Algebraic function: (X) = F({G(X1), …, G(Xn)}) • G(Xi) returns a length-fixed vector of values • E.g., Avg(X) = F({G(X1), …, G(Xn)}) • G(Xi) = [Sum(Xi), Count(Xi)] • F({[s1, c1], …, [sn, cn]}) = Sum({si}) / Sum({ci})

  15. Scoring Function • Represent a model as a function of sets • Conceptually, a machine-learning model h(X; Z(D)) is a scoring function Score(y, x; Z(D)) that gives each class y a score on test example x • h(x; Z(D)) = argmax y Score(y, x; Z(D)) • Score(y, x; Z(D))  p(y | x, Z(D)) • Z(D): The set of training examples (a cube subset of D)

  16. Bottom-up Score Computation • Key observations: • Observation 1:Score(y, x; Z(D)) is a function of cube subset Z(D); if it is distributive or algebraic, the data cube bottom-up technique can be directly applied • Observation 2: Having the scores for all the test examples and all the cells is sufficient to compute a prediction cube • Scores  predictions cell values • Details depend on what each cell means (i.e., type of prediction cubes); but straightforward

  17. Machine-Learning Models • Naïve Bayes: • Scoring function: algebraic • Kernel-density-based classifier: • Scoring function: distributive • Decision tree, random forest: • Neither distributive, nor algebraic • PBE: Probability-based ensemble (new) • To make any machine-learning model distributive • Approximation

  18. Probability-Based Ensemble PBE version of decision tree on [WA, 85] Decision tree on [WA, 85] Decision trees built on the lowest-level cells

  19. Probability-Based Ensemble • Scoring function: • h(y | x; bi(D)): Model h’s estimation of p(y | x, bi(D)) • g(bi | x): A model that predicts the probability that x belongs to base subset bi(D)

  20. Outline • Motivating example • Definition of prediction cubes • Efficient prediction cube materialization • Experimental results • Conclusion

  21. Experiments 1985 1985 … … WA … WA … … … … … • Quality of PBE on 8 UCI datasets • The quality of the PBE version of a model is slightly worse (0 ~ 6%) than the quality of the model trained directly on the whole training data. • Efficiency of the bottom-up score computation technique • Case study on demographic data PBE vs.

  22. Efficiency of Bottom-up Score Computation • Machine-learning models: • J48: J48 decision tree • RF: Random forest • NB: Naïve Bayes • KDC: Kernel-density-based classifier • Bottom-up method vs. Exhaustive method • PBE-J48 • PBE-RF • NB • KDC •  J48ex • RFex • NBex • KDCex

  23. Synthetic Dataset • Dimensions: Z1, Z2 and Z3. • Decision rule: Z1 and Z2 Z3

  24. Efficiency Comparison Using exhaustive method Execution Time (sec) Using bottom-up score computation # of Records

  25. Related Work: Building models on OLAP Results • Multi-dimensional regression [Chen, VLDB 02] • Goal: Detect changes of trends • Build linear regression models for cube cells • Step-by-step regression in stream cubes [Liu, PAKDD 03] • Loglinear-based quasi cubes [Barbara, J. IIS 01] • Use loglinear model to approximately compress dense regions of a data cube • NetCube [Margaritis, VLDB 01] • Build Bayes Net on the entire dataset of approximate answer count queries

  26. Related Work (Contd.) • Cubegrades [Imielinski, J. DMKD 02] • Extend cubes with ideas from association rules • How does the measure change when we rollup or drill down? • Constrained gradients [Dong, VLDB 01] • Find pairs of similar cell characteristics associated with big changes in measure • User-cognizant multidimensional analysis [Sarawagi, VLDBJ 01] • Help users find the most informative unvisited regions in a data cube using max entropy principle • Multi-Structural DBs [Fagin et al., PODS 05, VLDB 05]

  27. Take-Home Messages • Promising exploratory data analysis paradigm: • Can use models to identify interesting subsets • Concentrate only on subsets in cube space • Those are meaningful subsets, tractable • Precompute results and provide the users with an interactive tool • A simple way to plug “something” into cube-style analysis: • Try to describe/approximate “something” by a distributive or algebraic function

  28. Big Picture • Why stop with decision behavior? Can apply to other kinds of analyses too • Why stop at browsing? Can mine prediction cubes in their own right • Exploratory analysis of mining space: • Dimension attributes can be parameters related to algorithm, data conditioning, etc. • Tractable evaluation is a challenge: • Large number of “dimensions”, real-valued dimension attributes, difficulties in compositional evaluation • Active learning for experiment design, extending compositional methods

  29. Community Information Management (CIM) UI Anhai Doan University of Illinois at Urbana-Champaign Raghu Ramakrishnan University of Wisconsin-Madison

  30. Structured Web-Queries UI • Example Queries: • How many alumni are top-10 faculty members? • Wisconsin does very well, by the way • Find trends in publications • By topic, by conference, by alumni of schools • Change tracking • Alert me if my co-authors publish new papers or move to new jobs • Information is extracted from text sources on the web, then queried

  31. Key Ideas UI • Communities are ideally scoped chunks of the web for which to build enhanced portals • Relative uniformity in content, interests • Can exploit “people power” via mass collaboration, to augment extraction • CIM platform: Facilitate collaborative creation and maintenance of community portals • Extraction management • Uncertainty, provenance, maintenance, compositional inference for refining extracted information • Mass collaboration for extraction and integration Watch for new DBWorld!

  32. Challenges UI • User Interaction • Declarative specification of background knowledge and user feedback • Intelligent prompting for user input • Explanation of results

  33. Challenges UI • Extraction and Query Plans • Starting from user input (ER schema, hints) and background knowledge (e.g., standard types, look-up tables), compile a query into an execution plan • Must cover extraction, storage and indexing, and relational processing • And maintenance! • Algebra to represent such plans? Query optimizer? • Handling uncertainty, constraints, conflicts, multiple related sources, ranking, modular architecture

  34. Challenges UI • Managing extracted data • Mapping between extracted metadata and source data • Uncertainty of mapping • Conflicts (in user input, background knowledge, or from multiple sources) • Evolution over time

More Related