Prediction Cubes

Prediction Cubes Bee-Chung Chen, Lei Chen, Yi Lin and Raghu Ramakrishnan University of Wisconsin - Madison

Subset Mining • We want to find interesting subsets of the dataset • Interestingness: Defined by the “model” built on a subset • Cube space: A combination of dimension attribute values defines a candidate subset (just like regular OLAP) • We want the measures to represent decision/prediction behavior • Summarize a subset using the “model” built on it • Big change from regular OLAP!

The Idea • Build OLAP data cubes in which cell values represent decision/prediction behavior • In effect, build a tree for each cell/region in the cube—observe that this is not the same as a collection of trees used in an ensemble method! • The idea is simple, but it leads to promising data mining tools • Ultimate objective: Exploratory analysis of the entire space of “data mining choices” • Choice of algorithms, data conditioning parameters …

Example (1/7): Regular OLAP Location Time Z: Dimensions Y: Measure Goal: Look for patterns of unusually high numbers of applications:

Example (2/7): Regular OLAP 04 03 … Coarser regions CA 100 90 … USA 80 90 … … … … … 2004 … Jan … Dec … Roll up CA AB 20 15 15 … Drill down … 5 2 20 … 2004 2003 … YT 5 3 15 … Jan … Dec Jan … Dec … USA AL 55 … … … CA 30 20 50 25 30 … … … 5 … … USA 70 2 8 10 … … … WY 10 … … … … … … … … … … … … … … … … … Cell value: Number of loan applications Goal: Look for patterns of unusually high numbers of applications: Z: Dimensions Y: Measure Finer regions

Example (3/7): Decision Analysis Cube subset Location Time Race Sex … Approval Model h(X, Z(D)) E.g., decision tree AL, USA Dec, 04 White M … Yes … … … … … … WY, USA Dec, 04 Black F … No Location Time Goal: Analyze a bank’s loan decision processw.r.t. two dimensions: Location and Time Fact table D Z: Dimensions X: Predictors Y: Class

Example (3/7): Decision Analysis • Are there branches (and time windows) where approvals were closely tied to sensitive attributes (e.g., race)? • Suppose you partitioned the training data by location and time, chose the partition for a given branch and time window, and built a classifier. You could then ask, “Are the predictions of this classifier closely correlated with race?” • Are there branches and times with decision making reminiscent of 1950s Alabama? • Requires comparison of classifiers trained using different subsets of data.

Example (4/7): Prediction Cubes Data [USA, Dec 04](D) Location Time Race Sex … Approval AL ,USA Dec, 04 White M … Y … … … … … … WY, USA Dec, 04 Black F … N Model h(X, [USA, Dec 04](D)) E.g., decision tree • Build a model using data from USA in Dec., 1985 • Evaluate that model • Measure in a cell: • Accuracy of the model • Predictiveness of Race • measured based on that • model • Similarity between that • model and a given model

Example (5/7): Model-Similarity Data table D Location Time Race Sex … Approval AL, USA Dec, 04 White M … Yes … … … … … … WY, USA Dec, 04 Black F … No 2004 2003 … Jan … Dec Jan … Dec … CA 0.4 0.2 0.3 0.6 0.5 … … Build a model USA 0.2 0.3 0.9 … … … … … … … … … … … Similarity Race Sex … Level: [Country, Month] Yes Yes White F … … … … … … No Yes Black M … h0(X) Test set  Given: - Data table D - Target model h0(X) - Test set  w/o labels The loan decision process in USA during Dec 04 was similar to a discriminatory decision model

Example (6/7): Predictiveness Given: - Data table D - Attributes V - Test set  w/o labels Data table D Yes No . . Yes Yes No . . No Build models h(XV) h(X) Level: [Country, Month] Predictiveness of V Race was an important predictor of loan approval decision in USA during Dec 04 Test set 

Example (7/7): Prediction Cube Roll up 04 03 … CA 0.3 0.2 … USA 0.2 0.3 … … … … … 2004 2003 … Jan … Dec Jan … Dec … CA AB 0.4 0.2 0.1 0.1 0.2 … … … 0.1 0.1 0.3 0.3 … … … YT 0.3 0.2 0.1 0.2 … … … USA AL 0.2 0.1 0.2 … … … … Drill down … 0.3 0.1 0.1 … … … WY 0.9 0.7 0.8 … … … … … … … … … … … … … Cell value: Predictiveness of Race

Efficient Computation • Reduce prediction cube computation to data cube computation • Represent a data-mining model as a distributive or algebraic (bottom-up computable) aggregate function, so that data-cube techniques can be directly applied

Bottom-Up Data Cube Computation Cell Values: Numbers of loan applications

Functions on Sets • Bottom-up computable functions: Functions that can be computed using only summary information • Distributive function: (X) = F({(X1), …, (Xn)}) • X = X1 …  Xn and Xi  Xj = • E.g., Count(X) = Sum({Count(X1), …, Count(Xn)}) • Algebraic function: (X) = F({G(X1), …, G(Xn)}) • G(Xi) returns a length-fixed vector of values • E.g., Avg(X) = F({G(X1), …, G(Xn)}) • G(Xi) = [Sum(Xi), Count(Xi)] • F({[s1, c1], …, [sn, cn]}) = Sum({si}) / Sum({ci})

Scoring Function • Represent a model as a function of sets • Conceptually, a machine-learning model h(X; Z(D)) is a scoring function Score(y, x; Z(D)) that gives each class y a score on test example x • h(x; Z(D)) = argmax y Score(y, x; Z(D)) • Score(y, x; Z(D))  p(y | x, Z(D)) • Z(D): The set of training examples (a cube subset of D)

Bottom-up Score Computation • Key observations: • Observation 1:Score(y, x; Z(D)) is a function of cube subset Z(D); if it is distributive or algebraic, the data cube bottom-up technique can be directly applied • Observation 2: Having the scores for all the test examples and all the cells is sufficient to compute a prediction cube • Scores  predictions cell values • Details depend on what each cell means (i.e., type of prediction cubes); but straightforward

Machine-Learning Models • Naïve Bayes: • Scoring function: algebraic • Kernel-density-based classifier: • Scoring function: distributive • Decision tree, random forest: • Neither distributive, nor algebraic • PBE: Probability-based ensemble (new) • To make any machine-learning model distributive • Approximation

Probability-Based Ensemble PBE version of decision tree on [WA, 85] Decision tree on [WA, 85] Decision trees built on the lowest-level cells

Probability-Based Ensemble • Scoring function: • h(y | x; bi(D)): Model h’s estimation of p(y | x, bi(D)) • g(bi | x): A model that predicts the probability that x belongs to base subset bi(D)

Outline • Motivating example • Definition of prediction cubes • Efficient prediction cube materialization • Experimental results • Conclusion

Experiments 1985 1985 … … WA … WA … … … … … • Quality of PBE on 8 UCI datasets • The quality of the PBE version of a model is slightly worse (0 ~ 6%) than the quality of the model trained directly on the whole training data. • Efficiency of the bottom-up score computation technique • Case study on demographic data PBE vs.

Efficiency of Bottom-up Score Computation • Machine-learning models: • J48: J48 decision tree • RF: Random forest • NB: Naïve Bayes • KDC: Kernel-density-based classifier • Bottom-up method vs. Exhaustive method • PBE-J48 • PBE-RF • NB • KDC •  J48ex • RFex • NBex • KDCex

Synthetic Dataset • Dimensions: Z1, Z2 and Z3. • Decision rule: Z1 and Z2 Z3

Efficiency Comparison Using exhaustive method Execution Time (sec) Using bottom-up score computation # of Records

Related Work: Building models on OLAP Results • Multi-dimensional regression [Chen, VLDB 02] • Goal: Detect changes of trends • Build linear regression models for cube cells • Step-by-step regression in stream cubes [Liu, PAKDD 03] • Loglinear-based quasi cubes [Barbara, J. IIS 01] • Use loglinear model to approximately compress dense regions of a data cube • NetCube [Margaritis, VLDB 01] • Build Bayes Net on the entire dataset of approximate answer count queries

Related Work (Contd.) • Cubegrades [Imielinski, J. DMKD 02] • Extend cubes with ideas from association rules • How does the measure change when we rollup or drill down? • Constrained gradients [Dong, VLDB 01] • Find pairs of similar cell characteristics associated with big changes in measure • User-cognizant multidimensional analysis [Sarawagi, VLDBJ 01] • Help users find the most informative unvisited regions in a data cube using max entropy principle • Multi-Structural DBs [Fagin et al., PODS 05, VLDB 05]

Take-Home Messages • Promising exploratory data analysis paradigm: • Can use models to identify interesting subsets • Concentrate only on subsets in cube space • Those are meaningful subsets, tractable • Precompute results and provide the users with an interactive tool • A simple way to plug “something” into cube-style analysis: • Try to describe/approximate “something” by a distributive or algebraic function

Big Picture • Why stop with decision behavior? Can apply to other kinds of analyses too • Why stop at browsing? Can mine prediction cubes in their own right • Exploratory analysis of mining space: • Dimension attributes can be parameters related to algorithm, data conditioning, etc. • Tractable evaluation is a challenge: • Large number of “dimensions”, real-valued dimension attributes, difficulties in compositional evaluation • Active learning for experiment design, extending compositional methods

Community Information Management (CIM) UI Anhai Doan University of Illinois at Urbana-Champaign Raghu Ramakrishnan University of Wisconsin-Madison

Structured Web-Queries UI • Example Queries: • How many alumni are top-10 faculty members? • Wisconsin does very well, by the way • Find trends in publications • By topic, by conference, by alumni of schools • Change tracking • Alert me if my co-authors publish new papers or move to new jobs • Information is extracted from text sources on the web, then queried

Key Ideas UI • Communities are ideally scoped chunks of the web for which to build enhanced portals • Relative uniformity in content, interests • Can exploit “people power” via mass collaboration, to augment extraction • CIM platform: Facilitate collaborative creation and maintenance of community portals • Extraction management • Uncertainty, provenance, maintenance, compositional inference for refining extracted information • Mass collaboration for extraction and integration Watch for new DBWorld!

Challenges UI • User Interaction • Declarative specification of background knowledge and user feedback • Intelligent prompting for user input • Explanation of results

Challenges UI • Extraction and Query Plans • Starting from user input (ER schema, hints) and background knowledge (e.g., standard types, look-up tables), compile a query into an execution plan • Must cover extraction, storage and indexing, and relational processing • And maintenance! • Algebra to represent such plans? Query optimizer? • Handling uncertainty, constraints, conflicts, multiple related sources, ranking, modular architecture

Challenges UI • Managing extracted data • Mapping between extracted metadata and source data • Uncertainty of mapping • Conflicts (in user input, background knowledge, or from multiple sources) • Evolution over time

Prediction Cubes